thomwolf HF staff commited on
Commit
5dfedb3
·
1 Parent(s): 257496d
assets/images/5D_nutshell_tp_sp.svg ADDED
assets/images/5d_nutshell_cp.svg ADDED
assets/images/5d_nutshell_ep.svg ADDED
dist/assets/images/5D_nutshell_tp_sp.svg ADDED
dist/assets/images/5d_nutshell_cp.svg ADDED
dist/assets/images/5d_nutshell_ep.svg ADDED
dist/index.html CHANGED
@@ -1508,7 +1508,8 @@
1508
 
1509
  <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and interoperable with both Pipeline Parallelism and ZeRO-3, because it relies on the distributive property of matrix multiplication that allows weights and activations to be sharded and computed independently before being combined. However, TP has two important limitations: First, since its communication operations are part of the critical path of computation, it doesn't scale well beyond a certain point as communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more complex to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
1510
 
1511
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
 
1512
 
1513
 
1514
  <p>When combining parallelism strategies, TP will typically be kept for high-speed intra-node communications while ZeRO-3 or PP can use parallelism groups spanning lower speed inter-node communications, since their communication patterns are more amenable to scaling. The main consideration is organizing the GPU groups efficiently for each parallelism dimension to maximize throughput and minimize communication overhead, while being mindful of TP's scaling limitations.</p>
@@ -1518,7 +1519,9 @@
1518
 
1519
  <p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. This is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where even with full activation recomputation the memory requirements for attention would be prohibitive on a single GPU.</p>
1520
 
1521
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
 
 
1522
 
1523
 
1524
  <p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication pattern in EP is the all-to-all operation needed to route tokens to their assigned experts and gather the results back. While this introduces some communication overhead, it enables scaling model capacity significantly since each token only needs to compute through a fraction of the total parameters. This partitioning of experts across GPUs becomes essential when working with models that have a large number of experts, like DeepSeek which uses 256 experts.</p>
@@ -1531,14 +1534,15 @@
1531
  <li>Expert Parallelism primarly affects the MoE layers (which replace standard MLP blocks), leaving attention and other components unchanged</li>
1532
  </ul>
1533
 
 
 
1534
  <div class="note-box">
1535
- <p class="note-box-title">📝 Note</p>
1536
- <p class="note-box-content">
1537
- <p>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.
1538
- </p>
1539
- </div>
1540
 
1541
- <p>TODO: the text between the table and figueres is still a bit sparse.</p>
1542
 
1543
  <table>
1544
  <thead>
@@ -2576,13 +2580,26 @@
2576
  year={2025},
2577
  }</pre>
2578
  </d-appendix>
 
 
 
 
 
 
 
 
 
2579
 
2580
  <script>
2581
  const article = document.querySelector('d-article');
2582
  const toc = document.querySelector('d-contents');
2583
  if (toc) {
2584
  const headings = article.querySelectorAll('h2, h3, h4');
2585
- let ToC = `<nav role="navigation" class="l-text figcaption"><h3>Table of contents</h3>`;
 
 
 
 
2586
  let prevLevel = 0;
2587
 
2588
  for (const el of headings) {
@@ -2604,7 +2621,7 @@
2604
  }
2605
  if (level === 0)
2606
  ToC += '<div>' + link + '</div>';
2607
- else
2608
  ToC += '<li>' + link + '</li>';
2609
  }
2610
 
@@ -2612,10 +2629,10 @@
2612
  ToC += '</ul>'
2613
  prevLevel--;
2614
  }
2615
- ToC += '</nav>';
2616
  toc.innerHTML = ToC;
2617
  toc.setAttribute('prerendered', 'true');
2618
- const toc_links = document.querySelectorAll('d-contents > nav a');
2619
 
2620
  window.addEventListener('scroll', (_event) => {
2621
  if (typeof (headings) != 'undefined' && headings != null && typeof (toc_links) != 'undefined' && toc_links != null) {
 
1508
 
1509
  <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and interoperable with both Pipeline Parallelism and ZeRO-3, because it relies on the distributive property of matrix multiplication that allows weights and activations to be sharded and computed independently before being combined. However, TP has two important limitations: First, since its communication operations are part of the critical path of computation, it doesn't scale well beyond a certain point as communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more complex to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
1510
 
1511
+ <div class="l-page"><img alt="TP & SP diagram" src="/assets/images/5D_nutshell_tp_sp.svg" /></div>
1512
+ <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
1513
 
1514
 
1515
  <p>When combining parallelism strategies, TP will typically be kept for high-speed intra-node communications while ZeRO-3 or PP can use parallelism groups spanning lower speed inter-node communications, since their communication patterns are more amenable to scaling. The main consideration is organizing the GPU groups efficiently for each parallelism dimension to maximize throughput and minimize communication overhead, while being mindful of TP's scaling limitations.</p>
 
1519
 
1520
  <p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. This is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where even with full activation recomputation the memory requirements for attention would be prohibitive on a single GPU.</p>
1521
 
1522
+ <div class="l-page"><img alt="CP diagram" src="/assets/images/5d_nutshell_cp.svg" /></div>
1523
+
1524
+ <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
1525
 
1526
 
1527
  <p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication pattern in EP is the all-to-all operation needed to route tokens to their assigned experts and gather the results back. While this introduces some communication overhead, it enables scaling model capacity significantly since each token only needs to compute through a fraction of the total parameters. This partitioning of experts across GPUs becomes essential when working with models that have a large number of experts, like DeepSeek which uses 256 experts.</p>
 
1534
  <li>Expert Parallelism primarly affects the MoE layers (which replace standard MLP blocks), leaving attention and other components unchanged</li>
1535
  </ul>
1536
 
1537
+ <div class="l-page"><img alt="EP diagram" src="/assets/images/5d_nutshell_ep.svg" /></div>
1538
+
1539
  <div class="note-box">
1540
+ <p class="note-box-title">📝 Note</p>
1541
+ <p class="note-box-content">
1542
+ <p>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.
1543
+ </p>
1544
+ </div>
1545
 
 
1546
 
1547
  <table>
1548
  <thead>
 
2580
  year={2025},
2581
  }</pre>
2582
  </d-appendix>
2583
+ <script>
2584
+ function toggleTOC() {
2585
+ const content = document.querySelector('.toc-content');
2586
+ const icon = document.querySelector('.toggle-icon');
2587
+
2588
+ content.classList.toggle('collapsed');
2589
+ icon.classList.toggle('collapsed');
2590
+ }
2591
+ </script>
2592
 
2593
  <script>
2594
  const article = document.querySelector('d-article');
2595
  const toc = document.querySelector('d-contents');
2596
  if (toc) {
2597
  const headings = article.querySelectorAll('h2, h3, h4');
2598
+ // let ToC = `<nav role="navigation" class="l-text figcaption"><h3>Table of contents</h3>`;
2599
+ let ToC = `<nav role="navigation" class="l-text figcaption"><div class="toc-header" onclick="toggleTOC()">
2600
+ <span class="toc-title">Table of Contents</span>
2601
+ <span class="toggle-icon">▼</span>
2602
+ </div><div class="toc-content">`;
2603
  let prevLevel = 0;
2604
 
2605
  for (const el of headings) {
 
2621
  }
2622
  if (level === 0)
2623
  ToC += '<div>' + link + '</div>';
2624
+ else if (level === 1)
2625
  ToC += '<li>' + link + '</li>';
2626
  }
2627
 
 
2629
  ToC += '</ul>'
2630
  prevLevel--;
2631
  }
2632
+ ToC += '</div></nav>';
2633
  toc.innerHTML = ToC;
2634
  toc.setAttribute('prerendered', 'true');
2635
+ const toc_links = document.querySelectorAll('d-contents > nav div a');
2636
 
2637
  window.addEventListener('scroll', (_event) => {
2638
  if (typeof (headings) != 'undefined' && headings != null && typeof (toc_links) != 'undefined' && toc_links != null) {
dist/style.css CHANGED
@@ -150,6 +150,7 @@ d-contents > nav a.active {
150
  @media (max-width: 1199px) {
151
  d-contents {
152
  display: none;
 
153
  justify-self: start;
154
  align-self: start;
155
  padding-bottom: 0.5em;
@@ -160,7 +161,7 @@ d-contents > nav a.active {
160
  border-bottom-style: solid;
161
  border-bottom-color: rgba(0, 0, 0, 0.1);
162
  overflow-y: scroll;
163
- height: calc(100vh - 80px);
164
  scrollbar-width: none;
165
  z-index: -100;
166
  }
@@ -170,6 +171,31 @@ d-contents a:hover {
170
  border-bottom: none;
171
  }
172
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
173
 
174
  @media (min-width: 1200px) {
175
  d-article {
@@ -179,6 +205,7 @@ d-contents a:hover {
179
 
180
  d-contents {
181
  align-self: start;
 
182
  grid-column-start: 1 !important;
183
  grid-column-end: 4 !important;
184
  grid-row: auto / span 6;
@@ -186,16 +213,17 @@ d-contents a:hover {
186
  margin-top: 0em;
187
  padding-right: 3em;
188
  padding-left: 2em;
189
- border-right: 1px solid rgba(0, 0, 0, 0.1);
190
  border-right-width: 1px;
191
  border-right-style: solid;
192
- border-right-color: rgba(0, 0, 0, 0.1);
193
  position: -webkit-sticky; /* For Safari */
194
  position: sticky;
195
  top: 10px; /* Adjust this value if needed */
196
- overflow-y: scroll;
197
- height: calc(100vh - 80px);
198
  scrollbar-width: none;
 
199
  z-index: -100;
200
  }
201
  }
@@ -205,7 +233,7 @@ d-contents nav h3 {
205
  margin-bottom: 1em;
206
  }
207
 
208
- d-contents nav div {
209
  color: rgba(0, 0, 0, 0.8);
210
  font-weight: bold;
211
  }
 
150
  @media (max-width: 1199px) {
151
  d-contents {
152
  display: none;
153
+ background: white;
154
  justify-self: start;
155
  align-self: start;
156
  padding-bottom: 0.5em;
 
161
  border-bottom-style: solid;
162
  border-bottom-color: rgba(0, 0, 0, 0.1);
163
  overflow-y: scroll;
164
+ height: calc(100vh - 40px);
165
  scrollbar-width: none;
166
  z-index: -100;
167
  }
 
171
  border-bottom: none;
172
  }
173
 
174
+ toc-title {
175
+ font-weight: bold;
176
+ font-size: 1.2em;
177
+ color: #333;
178
+ }
179
+
180
+ toggle-icon {
181
+ transition: transform 0.3s;
182
+ }
183
+
184
+ toggle-icon.collapsed {
185
+ transform: rotate(-90deg);
186
+ }
187
+
188
+ .toc-content {
189
+ margin-top: 15px;
190
+ overflow: hidden;
191
+ max-height: 1000px;
192
+ transition: max-height 0.3s ease-out;
193
+ }
194
+
195
+ .toc-content.collapsed {
196
+ max-height: 0;
197
+ margin-top: 0;
198
+ }
199
 
200
  @media (min-width: 1200px) {
201
  d-article {
 
205
 
206
  d-contents {
207
  align-self: start;
208
+ background: white;
209
  grid-column-start: 1 !important;
210
  grid-column-end: 4 !important;
211
  grid-row: auto / span 6;
 
213
  margin-top: 0em;
214
  padding-right: 3em;
215
  padding-left: 2em;
216
+ /* border-right: 1px solid rgba(0, 0, 0, 0.1);
217
  border-right-width: 1px;
218
  border-right-style: solid;
219
+ border-right-color: rgba(0, 0, 0, 0.1); */
220
  position: -webkit-sticky; /* For Safari */
221
  position: sticky;
222
  top: 10px; /* Adjust this value if needed */
223
+ overflow-y: auto;
224
+ height: calc(100vh - 40px);
225
  scrollbar-width: none;
226
+ transition: max-height 0.3s ease-out;
227
  z-index: -100;
228
  }
229
  }
 
233
  margin-bottom: 1em;
234
  }
235
 
236
+ d-contents nav div div {
237
  color: rgba(0, 0, 0, 0.8);
238
  font-weight: bold;
239
  }
src/index.html CHANGED
@@ -1508,7 +1508,8 @@
1508
 
1509
  <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and interoperable with both Pipeline Parallelism and ZeRO-3, because it relies on the distributive property of matrix multiplication that allows weights and activations to be sharded and computed independently before being combined. However, TP has two important limitations: First, since its communication operations are part of the critical path of computation, it doesn't scale well beyond a certain point as communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more complex to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
1510
 
1511
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
 
1512
 
1513
 
1514
  <p>When combining parallelism strategies, TP will typically be kept for high-speed intra-node communications while ZeRO-3 or PP can use parallelism groups spanning lower speed inter-node communications, since their communication patterns are more amenable to scaling. The main consideration is organizing the GPU groups efficiently for each parallelism dimension to maximize throughput and minimize communication overhead, while being mindful of TP's scaling limitations.</p>
@@ -1518,7 +1519,9 @@
1518
 
1519
  <p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. This is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where even with full activation recomputation the memory requirements for attention would be prohibitive on a single GPU.</p>
1520
 
1521
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
 
 
1522
 
1523
 
1524
  <p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication pattern in EP is the all-to-all operation needed to route tokens to their assigned experts and gather the results back. While this introduces some communication overhead, it enables scaling model capacity significantly since each token only needs to compute through a fraction of the total parameters. This partitioning of experts across GPUs becomes essential when working with models that have a large number of experts, like DeepSeek which uses 256 experts.</p>
@@ -1531,14 +1534,15 @@
1531
  <li>Expert Parallelism primarly affects the MoE layers (which replace standard MLP blocks), leaving attention and other components unchanged</li>
1532
  </ul>
1533
 
 
 
1534
  <div class="note-box">
1535
- <p class="note-box-title">📝 Note</p>
1536
- <p class="note-box-content">
1537
- <p>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.
1538
- </p>
1539
- </div>
1540
 
1541
- <p>TODO: the text between the table and figueres is still a bit sparse.</p>
1542
 
1543
  <table>
1544
  <thead>
@@ -2576,13 +2580,26 @@
2576
  year={2025},
2577
  }</pre>
2578
  </d-appendix>
 
 
 
 
 
 
 
 
 
2579
 
2580
  <script>
2581
  const article = document.querySelector('d-article');
2582
  const toc = document.querySelector('d-contents');
2583
  if (toc) {
2584
  const headings = article.querySelectorAll('h2, h3, h4');
2585
- let ToC = `<nav role="navigation" class="l-text figcaption"><h3>Table of contents</h3>`;
 
 
 
 
2586
  let prevLevel = 0;
2587
 
2588
  for (const el of headings) {
@@ -2604,7 +2621,7 @@
2604
  }
2605
  if (level === 0)
2606
  ToC += '<div>' + link + '</div>';
2607
- else
2608
  ToC += '<li>' + link + '</li>';
2609
  }
2610
 
@@ -2612,10 +2629,10 @@
2612
  ToC += '</ul>'
2613
  prevLevel--;
2614
  }
2615
- ToC += '</nav>';
2616
  toc.innerHTML = ToC;
2617
  toc.setAttribute('prerendered', 'true');
2618
- const toc_links = document.querySelectorAll('d-contents > nav a');
2619
 
2620
  window.addEventListener('scroll', (_event) => {
2621
  if (typeof (headings) != 'undefined' && headings != null && typeof (toc_links) != 'undefined' && toc_links != null) {
 
1508
 
1509
  <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and interoperable with both Pipeline Parallelism and ZeRO-3, because it relies on the distributive property of matrix multiplication that allows weights and activations to be sharded and computed independently before being combined. However, TP has two important limitations: First, since its communication operations are part of the critical path of computation, it doesn't scale well beyond a certain point as communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more complex to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
1510
 
1511
+ <div class="l-page"><img alt="TP & SP diagram" src="/assets/images/5D_nutshell_tp_sp.svg" /></div>
1512
+ <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
1513
 
1514
 
1515
  <p>When combining parallelism strategies, TP will typically be kept for high-speed intra-node communications while ZeRO-3 or PP can use parallelism groups spanning lower speed inter-node communications, since their communication patterns are more amenable to scaling. The main consideration is organizing the GPU groups efficiently for each parallelism dimension to maximize throughput and minimize communication overhead, while being mindful of TP's scaling limitations.</p>
 
1519
 
1520
  <p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. This is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where even with full activation recomputation the memory requirements for attention would be prohibitive on a single GPU.</p>
1521
 
1522
+ <div class="l-page"><img alt="CP diagram" src="/assets/images/5d_nutshell_cp.svg" /></div>
1523
+
1524
+ <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
1525
 
1526
 
1527
  <p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication pattern in EP is the all-to-all operation needed to route tokens to their assigned experts and gather the results back. While this introduces some communication overhead, it enables scaling model capacity significantly since each token only needs to compute through a fraction of the total parameters. This partitioning of experts across GPUs becomes essential when working with models that have a large number of experts, like DeepSeek which uses 256 experts.</p>
 
1534
  <li>Expert Parallelism primarly affects the MoE layers (which replace standard MLP blocks), leaving attention and other components unchanged</li>
1535
  </ul>
1536
 
1537
+ <div class="l-page"><img alt="EP diagram" src="/assets/images/5d_nutshell_ep.svg" /></div>
1538
+
1539
  <div class="note-box">
1540
+ <p class="note-box-title">📝 Note</p>
1541
+ <p class="note-box-content">
1542
+ <p>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.
1543
+ </p>
1544
+ </div>
1545
 
 
1546
 
1547
  <table>
1548
  <thead>
 
2580
  year={2025},
2581
  }</pre>
2582
  </d-appendix>
2583
+ <script>
2584
+ function toggleTOC() {
2585
+ const content = document.querySelector('.toc-content');
2586
+ const icon = document.querySelector('.toggle-icon');
2587
+
2588
+ content.classList.toggle('collapsed');
2589
+ icon.classList.toggle('collapsed');
2590
+ }
2591
+ </script>
2592
 
2593
  <script>
2594
  const article = document.querySelector('d-article');
2595
  const toc = document.querySelector('d-contents');
2596
  if (toc) {
2597
  const headings = article.querySelectorAll('h2, h3, h4');
2598
+ // let ToC = `<nav role="navigation" class="l-text figcaption"><h3>Table of contents</h3>`;
2599
+ let ToC = `<nav role="navigation" class="l-text figcaption"><div class="toc-header" onclick="toggleTOC()">
2600
+ <span class="toc-title">Table of Contents</span>
2601
+ <span class="toggle-icon">▼</span>
2602
+ </div><div class="toc-content">`;
2603
  let prevLevel = 0;
2604
 
2605
  for (const el of headings) {
 
2621
  }
2622
  if (level === 0)
2623
  ToC += '<div>' + link + '</div>';
2624
+ else if (level === 1)
2625
  ToC += '<li>' + link + '</li>';
2626
  }
2627
 
 
2629
  ToC += '</ul>'
2630
  prevLevel--;
2631
  }
2632
+ ToC += '</div></nav>';
2633
  toc.innerHTML = ToC;
2634
  toc.setAttribute('prerendered', 'true');
2635
+ const toc_links = document.querySelectorAll('d-contents > nav div a');
2636
 
2637
  window.addEventListener('scroll', (_event) => {
2638
  if (typeof (headings) != 'undefined' && headings != null && typeof (toc_links) != 'undefined' && toc_links != null) {
src/style.css CHANGED
@@ -150,6 +150,7 @@ d-contents > nav a.active {
150
  @media (max-width: 1199px) {
151
  d-contents {
152
  display: none;
 
153
  justify-self: start;
154
  align-self: start;
155
  padding-bottom: 0.5em;
@@ -160,7 +161,7 @@ d-contents > nav a.active {
160
  border-bottom-style: solid;
161
  border-bottom-color: rgba(0, 0, 0, 0.1);
162
  overflow-y: scroll;
163
- height: calc(100vh - 80px);
164
  scrollbar-width: none;
165
  z-index: -100;
166
  }
@@ -170,6 +171,31 @@ d-contents a:hover {
170
  border-bottom: none;
171
  }
172
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
173
 
174
  @media (min-width: 1200px) {
175
  d-article {
@@ -179,6 +205,7 @@ d-contents a:hover {
179
 
180
  d-contents {
181
  align-self: start;
 
182
  grid-column-start: 1 !important;
183
  grid-column-end: 4 !important;
184
  grid-row: auto / span 6;
@@ -186,16 +213,17 @@ d-contents a:hover {
186
  margin-top: 0em;
187
  padding-right: 3em;
188
  padding-left: 2em;
189
- border-right: 1px solid rgba(0, 0, 0, 0.1);
190
  border-right-width: 1px;
191
  border-right-style: solid;
192
- border-right-color: rgba(0, 0, 0, 0.1);
193
  position: -webkit-sticky; /* For Safari */
194
  position: sticky;
195
  top: 10px; /* Adjust this value if needed */
196
- overflow-y: scroll;
197
- height: calc(100vh - 80px);
198
  scrollbar-width: none;
 
199
  z-index: -100;
200
  }
201
  }
@@ -205,7 +233,7 @@ d-contents nav h3 {
205
  margin-bottom: 1em;
206
  }
207
 
208
- d-contents nav div {
209
  color: rgba(0, 0, 0, 0.8);
210
  font-weight: bold;
211
  }
 
150
  @media (max-width: 1199px) {
151
  d-contents {
152
  display: none;
153
+ background: white;
154
  justify-self: start;
155
  align-self: start;
156
  padding-bottom: 0.5em;
 
161
  border-bottom-style: solid;
162
  border-bottom-color: rgba(0, 0, 0, 0.1);
163
  overflow-y: scroll;
164
+ height: calc(100vh - 40px);
165
  scrollbar-width: none;
166
  z-index: -100;
167
  }
 
171
  border-bottom: none;
172
  }
173
 
174
+ toc-title {
175
+ font-weight: bold;
176
+ font-size: 1.2em;
177
+ color: #333;
178
+ }
179
+
180
+ toggle-icon {
181
+ transition: transform 0.3s;
182
+ }
183
+
184
+ toggle-icon.collapsed {
185
+ transform: rotate(-90deg);
186
+ }
187
+
188
+ .toc-content {
189
+ margin-top: 15px;
190
+ overflow: hidden;
191
+ max-height: 1000px;
192
+ transition: max-height 0.3s ease-out;
193
+ }
194
+
195
+ .toc-content.collapsed {
196
+ max-height: 0;
197
+ margin-top: 0;
198
+ }
199
 
200
  @media (min-width: 1200px) {
201
  d-article {
 
205
 
206
  d-contents {
207
  align-self: start;
208
+ background: white;
209
  grid-column-start: 1 !important;
210
  grid-column-end: 4 !important;
211
  grid-row: auto / span 6;
 
213
  margin-top: 0em;
214
  padding-right: 3em;
215
  padding-left: 2em;
216
+ /* border-right: 1px solid rgba(0, 0, 0, 0.1);
217
  border-right-width: 1px;
218
  border-right-style: solid;
219
+ border-right-color: rgba(0, 0, 0, 0.1); */
220
  position: -webkit-sticky; /* For Safari */
221
  position: sticky;
222
  top: 10px; /* Adjust this value if needed */
223
+ overflow-y: auto;
224
+ height: calc(100vh - 40px);
225
  scrollbar-width: none;
226
+ transition: max-height 0.3s ease-out;
227
  z-index: -100;
228
  }
229
  }
 
233
  margin-bottom: 1em;
234
  }
235
 
236
+ d-contents nav div div {
237
  color: rgba(0, 0, 0, 0.8);
238
  font-weight: bold;
239
  }