Spaces:

datenwerkzeuge
/

CDL-Webscraping-Workshop-2025

Sleeping

App Files Files Community

bsenst commited on Jan 19

Commit

eb779a0

1 Parent(s): c4862fa

add glossary, remove extension

Browse files

Files changed (8) hide show

src/.gitignore +2 -1
src/01_setup/glossar.qmd +15 -6
src/_extensions/shafayetShafee/downloadthis/_extension.yml +0 -8
src/_extensions/shafayetShafee/downloadthis/downloadthis.lua +0 -121
src/_extensions/shafayetShafee/downloadthis/puremagic.lua +0 -735
src/_extensions/shafayetShafee/downloadthis/resources/css/downloadthis.css +0 -13
src/_quarto.yml +2 -2
src/index.qmd +1 -1

src/.gitignore CHANGED Viewed

@@ -1,2 +1,3 @@
 /.quarto/
-pdf

 /.quarto/
+pdf
+_extensions

src/01_setup/glossar.qmd CHANGED Viewed

@@ -1,6 +1,5 @@
 In der Welt des Internets und der Computerwissenschaften gibt es eine Vielzahl von Begriffen und Konzepten, die für das Verständnis und die Durchführung von Webscraping-Projekten relevant sind.
 ## A wie Asynchrones Scraping
 * AJAX (Asynchronous JavaScript and XML): Eine Gruppe von Webentwicklungstechniken, die für die Erstellung asynchroner Webanwendungen verwendet wird, was bedeutet, dass Webseiten dynamisch ohne Neuladen aktualisiert werden können.
 * API (Application Programming Interface): Eine Sammlung von Definitionen und Protokollen, die es ermöglichen, dass Softwareanwendungen miteinander kommunizieren können. Wird oft verwendet, um Daten legal und effizient zu extrahieren.
@@ -19,17 +18,17 @@ In der Welt des Internets und der Computerwissenschaften gibt es eine Vielzahl v
 ## D wie DOM
 * Data Pipeline: Ein Prozess oder System zum Sammeln, Transformieren und Laden von Daten von einer Quelle zu einem Ziel, oft im Webscraping-Kontext.
-* Dieses Glossar deckt einige der grundlegenden Begriffe ab, die beim Webscraping verwendet werden. Es gibt natürlich viele mehr spezialisierte Begriffe und Werkzeuge, aber dies sollte einen guten Überblick bieten.
 * DOM (Document Object Model): Eine Programmierschnittstelle für HTML- und XML-Dokumente, die eine strukturierte Darstellung der Dokumente ermöglicht, um sie zu manipulieren und zu durchsuchen.
 * DOM Manipulation: Der Vorgang, das Document Object Model einer Webseite zu ändern oder zu untersuchen, um Daten zu extrahieren oder Interaktionen zu simulieren.
 * Dynamic Content: Inhalte auf einer Webseite, die erst nach dem Laden der Seite durch JavaScript generiert oder verändert werden, erfordern oft spezielle Scraping-Techniken.
 ## E wie Element
 * Elastic IP: Eine statische IP-Adresse, die in der Cloud-Computing-Umgebung verwendet wird und bei Bedarf an Instanzen zugewiesen werden kann, nützlich für langfristige Scraping-Projekte.
-* Element: Ein einzelner Teil eines HTML- oder XML-Dokuments, wie z.B. ein <div> oder <p>-Tag, aus dem Daten extrahiert werden können.
 * Encoding: Bezieht sich auf die Methode, mit der Daten in einem Webscraping-Kontext dargestellt und interpretiert werden, z.B. UTF-8.
 ## F wie Fingerprinting
 * Fiddler: Ein Web-Debugger, der verwendet werden kann, um HTTP-Verkehr zu überwachen, was beim Debuggen von Webscraping-Skripten hilfreich sein kann.
 * Fingerprinting: Der Prozess der Erkennung und Identifizierung von Anfragen durch Analyse von Browser- und Benutzereigenschaften, um anti-scraping-Maßnahmen zu setzen.
 * Frame Handling: Umgang mit Frames oder IFrames auf Webseiten, die separate HTML-Dokumente enthalten, die gesondert gescraped werden müssen.
@@ -41,7 +40,6 @@ In der Welt des Internets und der Computerwissenschaften gibt es eine Vielzahl v
 ## H wie Headless
 * Headless Browser: Ein Webbrowser ohne grafische Benutzeroberfläche, der von Skripten gesteuert wird, um Webseiten wie ein normaler Browser zu rendern, aber ohne Sichtbarkeit.
-* Hier sind einige zusätzliche Begriffe, die in das Webscraping-Glossar aufgenommen werden können:
 * Honeypot: Eine Falle, die von Websites platziert wird, um automatische Scraping- oder Hacking-Versuche zu erkennen und zu blockieren.
 * HTML5: Die neueste Version von HTML, die viele neue Elemente und Attribute einführt, die beim Scraping berücksichtigt werden müssen.
 * HTTP (Hypertext Transfer Protocol): Das Protokoll, das für die Übertragung von Webseiten und Daten über das Internet verwendet wird.
@@ -88,6 +86,7 @@ In der Welt des Internets und der Computerwissenschaften gibt es eine Vielzahl v
 * Page Object Model: Ein Designmuster bei der Automatisierung, das die Repräsentation von Webseiten als Objekte mit bestimmten Methoden und Eigenschaften erlaubt.
 * Pagination: Eine Technik, Inhalte auf mehreren Seiten zu verteilen, anstatt alles auf einer einzigen Seite anzuzeigen. Um die Darstellung großer Datenmengen zu verwalten, indem die Ergebnisse in kleinere, überschaubare Teile aufgeteilt werden. Viele Websites verwenden ein konsistentes Muster in ihren URLs, um auf verschiedene Seiten zu verlinken (z.B. ?page=2). Manche Seiten laden Inhalte dynamisch, was bedeutet, dass ein Scraper möglicherweise JavaScript ausführen muss, um die nächsten Seiteninhalte zu laden.
 * Parser: Ein Programm, das eine Struktur (wie HTML) in eine andere Form, die für die Verarbeitung geeignet ist, umwandelt.
 * PhantomJS: War ein skriptbares Headless Webkit, das für Webscraping genutzt wurde; jedoch ist es seit 2018 nicht mehr weiterentwickelt.
 * Proxy: Ein Server, der als Mittelsmann zwischen einem Client und dem Internet fungiert. Kann verwendet werden, um Anfragen zu maskieren oder den Standort des Scrapers zu verschleiern.
 * Puppeteer: Eine Node.js-Bibliothek, die eine High-Level API zum Steuern von Headless Chrome oder Chromium über das DevTools-Protokoll bietet, oft für Webscraping verwendet.
@@ -100,7 +99,10 @@ In der Welt des Internets und der Computerwissenschaften gibt es eine Vielzahl v
 * Regex (Regular Expressions): Mächtige Suchmuster, die verwendet werden können, um spezifische Textmuster in Webseiten zu finden und zu extrahieren.
 * Request Headers: Metadaten, die mit jeder HTTP-Anfrage gesendet werden, können manipuliert werden, um wie ein legitimer Benutzer zu erscheinen.
 * Request: Ein Modul in Python, das es ermöglicht, HTTP-Anfragen zu senden. Wird oft im Zusammenhang mit Webscraping verwendet.
 * Robots.txt: Eine Datei, die von Webseitenbetreibern verwendet wird, um zu definieren, welche Teile ihrer Website von Bots (wie Webscrapern) durchsucht werden dürfen.
 ## S wie Selenium
 * Scraper: Ein Skript oder Programm, das Daten von Websites extrahiert.
@@ -108,6 +110,13 @@ In der Welt des Internets und der Computerwissenschaften gibt es eine Vielzahl v
 * Selenium: Ein Tool, das hauptsächlich für das Testen von Web-Anwendungen verwendet wird, aber auch für Webscraping, da es eine Browser-Automatisierung bietet, um mit JavaScript-reichen Seiten umzugehen.
 * Spider: Ein spezifischer Begriff für ein Programm oder Modul, das durch Webseiten navigiert und Daten sammelt, oft in Verbindung mit Frameworks wie Scrapy verwendet.
 * Splash: Eine JavaScript-Rendering-Service, der oft mit Scrapy verwendet wird, um dynamische Webseiten zu rendern.
 ## T wie Throttling
 * Text Mining: Der Prozess der Extraktion von nützlichem Wissen aus Text, oft nach dem Scraping von Textinhalten.
@@ -138,6 +147,6 @@ In der Welt des Internets und der Computerwissenschaften gibt es eine Vielzahl v
 ## Y wie Yield
 * Yield: In Python eine Schlüsselwörter, das in Generator-Funktionen verwendet wird, um ein Ergebnis zurückzugeben und die Ausführung zu pausieren, bis der nächste Wert angefordert wird, nützlich in Webscraping für effizientes Arbeiten mit großen Datenmengen.
-## Z wie
 * Zope: Ein Python-Web-Framework, das manchmal im Zusammenhang mit Webscraping erwähnt wird, da es Tools und Bibliotheken bietet, die auch in Scraping-Projekten nützlich sein können.
-* Zyte (ehemals Scrapinghub) ist ein führendes Unternehmen im Bereich der Web-Datenextraktion und bietet eine Vielzahl von Tools und Diensten, die Webscraping vereinfachen und skalieren. Hier sind einige der Hauptaspekte von Zyte:

 In der Welt des Internets und der Computerwissenschaften gibt es eine Vielzahl von Begriffen und Konzepten, die für das Verständnis und die Durchführung von Webscraping-Projekten relevant sind.
 ## A wie Asynchrones Scraping
 * AJAX (Asynchronous JavaScript and XML): Eine Gruppe von Webentwicklungstechniken, die für die Erstellung asynchroner Webanwendungen verwendet wird, was bedeutet, dass Webseiten dynamisch ohne Neuladen aktualisiert werden können.
 * API (Application Programming Interface): Eine Sammlung von Definitionen und Protokollen, die es ermöglichen, dass Softwareanwendungen miteinander kommunizieren können. Wird oft verwendet, um Daten legal und effizient zu extrahieren.
 ## D wie DOM
 * Data Pipeline: Ein Prozess oder System zum Sammeln, Transformieren und Laden von Daten von einer Quelle zu einem Ziel, oft im Webscraping-Kontext.
 * DOM (Document Object Model): Eine Programmierschnittstelle für HTML- und XML-Dokumente, die eine strukturierte Darstellung der Dokumente ermöglicht, um sie zu manipulieren und zu durchsuchen.
 * DOM Manipulation: Der Vorgang, das Document Object Model einer Webseite zu ändern oder zu untersuchen, um Daten zu extrahieren oder Interaktionen zu simulieren.
 * Dynamic Content: Inhalte auf einer Webseite, die erst nach dem Laden der Seite durch JavaScript generiert oder verändert werden, erfordern oft spezielle Scraping-Techniken.
 ## E wie Element
 * Elastic IP: Eine statische IP-Adresse, die in der Cloud-Computing-Umgebung verwendet wird und bei Bedarf an Instanzen zugewiesen werden kann, nützlich für langfristige Scraping-Projekte.
+* Element: Ein einzelner Teil eines HTML- oder XML-Dokuments, wie z.B. ein `<div>` oder `<p>`-Tag, aus dem Daten extrahiert werden können.
 * Encoding: Bezieht sich auf die Methode, mit der Daten in einem Webscraping-Kontext dargestellt und interpretiert werden, z.B. UTF-8.
 ## F wie Fingerprinting
+* Feed bezieht sich in der Webtechnologie auf eine Quelle von Inhalten, die regelmäßig aktualisiert werden und in einem standardisierten Format wie RSS, Atom oder JSON Feed präsentiert werden.
 * Fiddler: Ein Web-Debugger, der verwendet werden kann, um HTTP-Verkehr zu überwachen, was beim Debuggen von Webscraping-Skripten hilfreich sein kann.
 * Fingerprinting: Der Prozess der Erkennung und Identifizierung von Anfragen durch Analyse von Browser- und Benutzereigenschaften, um anti-scraping-Maßnahmen zu setzen.
 * Frame Handling: Umgang mit Frames oder IFrames auf Webseiten, die separate HTML-Dokumente enthalten, die gesondert gescraped werden müssen.
 ## H wie Headless
 * Headless Browser: Ein Webbrowser ohne grafische Benutzeroberfläche, der von Skripten gesteuert wird, um Webseiten wie ein normaler Browser zu rendern, aber ohne Sichtbarkeit.
 * Honeypot: Eine Falle, die von Websites platziert wird, um automatische Scraping- oder Hacking-Versuche zu erkennen und zu blockieren.
 * HTML5: Die neueste Version von HTML, die viele neue Elemente und Attribute einführt, die beim Scraping berücksichtigt werden müssen.
 * HTTP (Hypertext Transfer Protocol): Das Protokoll, das für die Übertragung von Webseiten und Daten über das Internet verwendet wird.
 * Page Object Model: Ein Designmuster bei der Automatisierung, das die Repräsentation von Webseiten als Objekte mit bestimmten Methoden und Eigenschaften erlaubt.
 * Pagination: Eine Technik, Inhalte auf mehreren Seiten zu verteilen, anstatt alles auf einer einzigen Seite anzuzeigen. Um die Darstellung großer Datenmengen zu verwalten, indem die Ergebnisse in kleinere, überschaubare Teile aufgeteilt werden. Viele Websites verwenden ein konsistentes Muster in ihren URLs, um auf verschiedene Seiten zu verlinken (z.B. ?page=2). Manche Seiten laden Inhalte dynamisch, was bedeutet, dass ein Scraper möglicherweise JavaScript ausführen muss, um die nächsten Seiteninhalte zu laden.
 * Parser: Ein Programm, das eine Struktur (wie HTML) in eine andere Form, die für die Verarbeitung geeignet ist, umwandelt.
+* Parsing ist der Prozess der Analyse und Interpretation von Datenstrukturen oder Text, um sie in eine andere, oft einfacher zu verarbeitende Form zu bringen.
 * PhantomJS: War ein skriptbares Headless Webkit, das für Webscraping genutzt wurde; jedoch ist es seit 2018 nicht mehr weiterentwickelt.
 * Proxy: Ein Server, der als Mittelsmann zwischen einem Client und dem Internet fungiert. Kann verwendet werden, um Anfragen zu maskieren oder den Standort des Scrapers zu verschleiern.
 * Puppeteer: Eine Node.js-Bibliothek, die eine High-Level API zum Steuern von Headless Chrome oder Chromium über das DevTools-Protokoll bietet, oft für Webscraping verwendet.
 * Regex (Regular Expressions): Mächtige Suchmuster, die verwendet werden können, um spezifische Textmuster in Webseiten zu finden und zu extrahieren.
 * Request Headers: Metadaten, die mit jeder HTTP-Anfrage gesendet werden, können manipuliert werden, um wie ein legitimer Benutzer zu erscheinen.
 * Request: Ein Modul in Python, das es ermöglicht, HTTP-Anfragen zu senden. Wird oft im Zusammenhang mit Webscraping verwendet.
+* Request Parameter sind Daten, die in einer HTTP-Anfrage an den Server übergeben werden, um zusätzliche Informationen zu liefern oder die Anfrage zu spezifizieren. Teil der URL nach einem `?`, z.B. `?id=123&name=test`.
+* REST API (Representational State Transfer Application Programming Interface) ist ein Architekturstil für das Design von vernetzten Anwendungen. Es nutzt HTTP-Methoden für CRUD-Operationen (Create, Read, Update, Delete) und ist darauf ausgelegt, stateless zu sein, was bedeutet, dass jede Anfrage alle Informationen enthält, die der Server benötigt, um sie zu verarbeiten.
 * Robots.txt: Eine Datei, die von Webseitenbetreibern verwendet wird, um zu definieren, welche Teile ihrer Website von Bots (wie Webscrapern) durchsucht werden dürfen.
+* RSS (Really Simple Syndication) ist ein XML-basierter Formatstandard, der verwendet wird, um häufig aktualisierte Inhalte wie Blogeinträge, Nachrichten oder Podcasts zu veröffentlichen. RSS ist trotz der Zunahme anderer Technologien wie APIs für die Content-Syndikation immer noch weit verbreitet, besonders in Nischen für fortlaufende, zeitnahe Informationen.
 ## S wie Selenium
 * Scraper: Ein Skript oder Programm, das Daten von Websites extrahiert.
 * Selenium: Ein Tool, das hauptsächlich für das Testen von Web-Anwendungen verwendet wird, aber auch für Webscraping, da es eine Browser-Automatisierung bietet, um mit JavaScript-reichen Seiten umzugehen.
 * Spider: Ein spezifischer Begriff für ein Programm oder Modul, das durch Webseiten navigiert und Daten sammelt, oft in Verbindung mit Frameworks wie Scrapy verwendet.
 * Splash: Eine JavaScript-Rendering-Service, der oft mit Scrapy verwendet wird, um dynamische Webseiten zu rendern.
+* Status Codes sind numerische Codes, die in HTTP-Antworten zurückgegeben werden, um den Status der Anfrage zu kommunizieren.
+    * 1xx (Informational): Anfrage wird verarbeitet, z.B. 100 Continue.
+    * 2xx (Success): Anfrage erfolgreich, z.B. 200 OK, 201 Created.
+    * 3xx (Redirection): Weitere Aktionen sind erforderlich, z.B. 301 Moved Permanently, 302 Found.
+    * 4xx (Client Error): Anfrage kann nicht verarbeitet werden, da die Anfrage fehlerhaft ist, z.B. 400 Bad Request, 404 Not Found.
+    * 5xx (Server Error): Der Server hat die Anfrage nicht erfüllen können, z.B. 500 Internal Server Error, 503 Service Unavailable.
+* Stream bezeichnet den kontinuierlichen Fluss von Daten, oft in Echtzeit, wie bei Live-Video oder Datenanalyse, und ermöglicht die sofortige Verarbeitung und Übertragung von Informationen. Streaming-Technologien nutzen Protokolle und Plattformen, um Daten effizient von Quellen zu Konsumenten zu leiten, sei es für Multimedia oder Datenverarbeitung.
 ## T wie Throttling
 * Text Mining: Der Prozess der Extraktion von nützlichem Wissen aus Text, oft nach dem Scraping von Textinhalten.
 ## Y wie Yield
 * Yield: In Python eine Schlüsselwörter, das in Generator-Funktionen verwendet wird, um ein Ergebnis zurückzugeben und die Ausführung zu pausieren, bis der nächste Wert angefordert wird, nützlich in Webscraping für effizientes Arbeiten mit großen Datenmengen.
+## Z wie Zyte
 * Zope: Ein Python-Web-Framework, das manchmal im Zusammenhang mit Webscraping erwähnt wird, da es Tools und Bibliotheken bietet, die auch in Scraping-Projekten nützlich sein können.
+* Zyte (ehemals Scrapinghub) ist ein führendes Unternehmen im Bereich der Web-Datenextraktion und bietet eine Vielzahl von Tools und Diensten, die Webscraping vereinfachen und skalieren.

src/_extensions/shafayetShafee/downloadthis/_extension.yml DELETED Viewed

@@ -1,8 +0,0 @@
-title: Downloadthis
-author: Shafayet Khan Shafee
-version: 1.1.0
-quarto-required: ">=1.2.0"
-contributes:
-  shortcodes:
-    - downloadthis.lua

src/_extensions/shafayetShafee/downloadthis/downloadthis.lua DELETED Viewed

@@ -1,121 +0,0 @@
---[[
-MIT License
-Copyright (c) 2023 Shafayet Khan Shafee
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this software and associated documentation files (the "Software"), to deal
-in the Software without restriction, including without limitation the rights
-to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-copies of the Software, and to permit persons to whom the Software is
-furnished to do so, subject to the following conditions:
-The above copyright notice and this permission notice shall be included in all
-copies or substantial portions of the Software.
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.
-]]--
-local str = pandoc.utils.stringify
---local p = quarto.log.output
-local function ensureHtmlDeps()
-  quarto.doc.add_html_dependency({
-    name = "downloadthis",
-    version = "1.9.1",
-    stylesheets = {"resources/css/downloadthis.css"}
-  })
-end
-local function optional(arg, default)
-  if arg == nil or arg == ""
-  then
-    return default
-  else
-    return arg
-  end
-end
-function import(script)
-  local path = PANDOC_SCRIPT_FILE:match("(.*[/\\])")
-  package.path = path .. script .. ";" .. package.path
-  return require(script)
-end
-local puremagic = import("puremagic.lua")
-return {
-  ['downloadthis'] = function(args, kwargs, meta)
-    -- args and kwargs
-    local file_path = str(args[1])
-    local extension = "." .. file_path:match("[^.]+$")
-    local dname = optional(str(kwargs["dname"]), "file")
-    local dfilename = dname .. extension
-    local btn_label = " " .. optional(str(kwargs["label"]), "Download") .. " "
-    local btn_type = optional(str(kwargs["type"]), "default")
-    local icon = optional(str(kwargs["icon"]), "download")
-    local class = " " .. optional(str(kwargs["class"]), "")
-    local rand = "dnldts" .. str(math.random(1, 65000))
-    local id = optional(str(kwargs["id"]), rand)
-    -- reading files
-    local fh = io.open(file_path, "rb")
-    if not fh then
-        io.stderr:write("Cannot open file " ..
-          file_path ..
-          " | Skipping adding buttons\n")
-        return pandoc.Null()
-    else
-      local contents = fh:read("*all")
-      fh:close()
-      -- creating dataURI object
-      local b64_encoded = quarto.base64.encode(contents)
-      local mimetype = puremagic.via_path(file_path)
-      local data_uri = 'data:' .. mimetype .. ";base64," .. b64_encoded
-      -- js code taken from
-      -- https://github.com/fmmattioni/downloadthis/blob/master/R/utils.R#L59
-      local js = [[fetch('%s').then(res => res.blob()).then(blob => {
-      const downloadURL = window.URL.createObjectURL(blob);
-      const a = document.createElement('a');
-      document.body.appendChild(a);
-      a.href = downloadURL;
-      a.download = '%s'; a.click();
-      window.URL.revokeObjectURL(downloadURL);
-      document.body.removeChild(a);
-        });]]
-      local clicked = js:format(data_uri, dfilename)
-      -- creating button
-      local button =
-          "<button class=\"btn btn-" .. btn_type .. " downloadthis " ..
-          class .. "\"" ..
-          " id=\"" .. id .. "\"" ..
-          "><i class=\"bi bi-" .. icon .. "\"" .. "></i>" ..
-          btn_label ..
-          "</button>"
-      if quarto.doc.is_format("html:js") and quarto.doc.has_bootstrap()
-      then
-        ensureHtmlDeps()
-        return pandoc.RawInline('html',
-        "<a href=\"#" .. id .. "\"" ..
-        " onclick=\"" .. clicked .. "\">" .. button .. "</a>"
-        )
-      else
-        return pandoc.Null()
-      end
-    end
-  end
-}

src/_extensions/shafayetShafee/downloadthis/puremagic.lua DELETED Viewed

@@ -1,735 +0,0 @@
--- puremagic 1.0.1
--- Copyright (c) 2014 Will Bond <[email protected]>
--- Licensed under the MIT license.
-function basename(path)
-    local basename_match = path:match('[/\\]([^/\\]+)$')
-    if basename_match then
-        return basename_match, nil
-    end
-    return path, nil
-end
-function extension(path)
-    path = path:lower()
-    local tar_match = path:match('%.(tar%.[^.]+)$')
-    if tar_match then
-        return tar_match
-    end
-    if path:sub(#path - 11, #path) == '.numbers.zip' then
-        return 'numbers.zip'
-    end
-    if path:sub(#path - 9, #path) == '.pages.zip' then
-        return 'pages.zip'
-    end
-    if path:sub(#path - 7, #path) == '.key.zip' then
-        return 'key.zip'
-    end
-    return path:match('%.([^.]+)$')
-end
-function in_table(value, list)
-    for i=1, #list do
-        if list[i] == value then
-            return true
-        end
-    end
-    return false
-end
-function string_to_bit_table(chars)
-    local output = {}
-    for char in chars:gmatch('.') do
-        local num = string.byte(char)
-        local bits = {0, 0, 0, 0, 0, 0, 0, 0}
-        for bit=8, 1, -1 do
-            if num > 0 then
-                bits[bit] = math.fmod(num, 2)
-                num = (num - bits[bit]) / 2
-            end
-        end
-        table.insert(output, bits)
-    end
-    return output
-end
-function bit_table_to_string(bits)
-    local output = {}
-    for i = 1, #bits do
-        local num = tonumber(table.concat(bits[i]), 2)
-        table.insert(output, string.format('%c', num))
-    end
-    return table.concat(output)
-end
-function bitwise_and(a, b)
-    local a_bytes = string_to_bit_table(a)
-    local b_bytes = string_to_bit_table(b)
-    local output = {}
-    for i = 1, #a_bytes do
-        local bits = {0, 0, 0, 0, 0, 0, 0, 0}
-        for j = 1, 8 do
-            if a_bytes[i][j] == 1 and b_bytes[i][j] == 1 then
-                bits[j] = 1
-            else
-                bits[j] = 0
-            end
-        end
-        table.insert(output, bits)
-    end
-    return bit_table_to_string(output)
-end
--- Unpack a little endian byte string into an integer
-function unpack_le(chars)
-    local bit_table = string_to_bit_table(chars)
-    -- Merge the bits into a string of 1s and 0s
-    local result = {}
-    for i=1, #bit_table do
-        result[#chars + 1 - i] = table.concat(bit_table[i])
-    end
-    return tonumber(table.concat(result), 2)
-end
--- Unpack a big endian byte string into an integer
-function unpack_be(chars)
-    local bit_table = string_to_bit_table(chars)
-    -- Merge the bits into a string of 1s and 0s
-    for i=1, #bit_table do
-        bit_table[i] = table.concat(bit_table[i])
-    end
-    return tonumber(table.concat(bit_table), 2)
-end
--- Takes the first 4-8k of an EBML file and identifies if it is matroska or webm
--- and it it contains just video or just audio.
-function ebml_parse(content)
-    local position = 1
-    local length = #content
-    local header_token, header_value, used_bytes = ebml_parse_section(content)
-    position = position + used_bytes
-    if header_token ~= '\x1AE\xDF\xA3' then
-        return nil, 'Unable to find EBML ID'
-    end
-    -- The matroska spec sets the default doctype to be 'matroska', however
-    -- many file specify this anyway. The other option is 'webm'.
-    local doctype = 'matroska'
-    if header_value['B\x82'] then
-        doctype = header_value['B\x82']
-    end
-    if doctype ~= 'matroska' and doctype ~= 'webm' then
-        return nil, 'Unknown EBML doctype'
-    end
-    local segment_position = nil
-    local track_position = nil
-    local has_video = false
-    local found_tracks = false
-    while position <= length do
-        local ebml_id, ebml_value, used_bytes = ebml_parse_section(content:sub(position, length))
-        position = position + used_bytes
-        -- Segment
-        if ebml_id == '\x18S\x80g' then
-            segment_position = position
-        end
-        -- Meta seek information
-        if ebml_id == '\x11M\x9Bt' then
-            -- Look for the seek info about the tracks token
-            for i, child in ipairs(ebml_value['M\xBB']) do
-                if child['S\xAB'] == '\x16T\xAEk' then
-                    track_position = segment_position + unpack_be(child['S\xAC'])
-                    position = track_position
-                    break
-                end
-            end
-        end
-        -- Track
-        if ebml_id == '\x16T\xAEk' then
-            found_tracks = true
-            -- Scan through each track looking for video
-            for i, child in ipairs(ebml_value['\xAE']) do
-                -- Look to see if the track type is video
-                if unpack_be(child['\x83']) == 1 then
-                    has_video = true
-                    break
-                end
-            end
-            break
-        end
-    end
-    if found_tracks and not has_video then
-        if doctype == 'matroska' then
-            return 'audio/x-matroska'
-        else
-            return 'audio/webm'
-        end
-    end
-    if doctype == 'matroska' then
-        return 'video/x-matroska'
-    else
-        return 'video/webm'
-    end
-end
--- Parses a section of an EBML document, returning the EBML ID at the beginning,
--- plus the value as a table with child EBML IDs as keys and the number of
--- bytes from the content that contained the ID and value
-function ebml_parse_section(content)
-    local ebml_id, element_length, used_bytes = ebml_id_and_length(content)
-    -- Don't parse the segment since it is the whole file!
-    if ebml_id == '\x18\x53\x80\x67' then
-        return ebml_id, nil, used_bytes
-    end
-    local ebml_value = content:sub(used_bytes + 1, used_bytes + element_length)
-    used_bytes = used_bytes + element_length
-    -- We always parse the return value of level 0/1 elements
-    local recursive_parse = false
-    if #ebml_id == 4 then
-        recursive_parse = true
-    -- We need Seek information
-    elseif ebml_id == '\x4D\xBB' then
-        recursive_parse = true
-    -- We want the top-level of TrackEntry to grab the TrackType
-    elseif ebml_id == '\xAE' then
-        recursive_parse = true
-    end
-    if recursive_parse then
-        local buffer = ebml_value
-        ebml_value = {}
-        -- Track which child entries have been converted to an array
-        local array_children = {}
-        while #buffer > 0 do
-            local child_ebml_id, child_ebml_value, child_used_bytes = ebml_parse_section(buffer)
-            if array_children[child_ebml_id] then
-                table.insert(ebml_value[child_ebml_id], child_ebml_value)
-            -- Single values are just stores by themselves
-            elseif ebml_value[child_ebml_id] == nil then
-                -- Force seek info and tracks to be arrays even if there is only one
-                if child_ebml_id == 'M\xBB' or child_ebml_id == '\xAE' then
-                    child_ebml_value = {child_ebml_value}
-                    array_children[child_ebml_id] = true
-                end
-                ebml_value[child_ebml_id] = child_ebml_value
-            -- If there is already a value for the ID, turn it into a table
-            else
-                ebml_value[child_ebml_id] = {ebml_value[child_ebml_id], child_ebml_value}
-                array_children[child_ebml_id] = true
-            end
-            -- Move past the part we've parsed
-            buffer = buffer:sub(child_used_bytes + 1, #buffer)
-        end
-    end
-    return ebml_id, ebml_value, used_bytes
-end
--- Should accept 12+ bytes, will return the ebml id, the data length and the
--- number of bytes that were used to hold those values.
-function ebml_id_and_length(chars)
-    -- The ID is encoded the same way as the length, however, we don't want
-    -- to remove the length bits from the ID value or intepret it as an
-    -- unsigned int since all of the documentation online references the IDs in
-    -- encoded form.
-    local _, id_length = ebml_length(chars:sub(1, 4))
-    local ebml_id = chars:sub(1, id_length)
-    local remaining = chars:sub(id_length + 1, id_length + 8)
-    local element_length, used_bytes = ebml_length(remaining)
-    return ebml_id, element_length, id_length + used_bytes
-end
--- Should accept 8+ bytes, will return the data length plus the number of bytes
--- that were used to hold the data length.
-function ebml_length(chars)
-    -- We substring chars to ensure we don't build a huge table we don't need
-    local bit_tables = string_to_bit_table(chars:sub(1, 8))
-    local value_length = 1
-    for i=1, #bit_tables[1] do
-        if bit_tables[1][i] == 0 then
-            value_length = value_length + 1
-        else
-            -- Clear the indicator bit so the rest of the byte
-            bit_tables[1][i] = 0
-            break
-        end
-    end
-    local bits = {}
-    for i=1, value_length do
-        table.insert(bits, table.concat(bit_tables[i]))
-    end
-    return tonumber(table.concat(bits), 2), value_length
-end
-function binary_tests(content, ext)
-    local length = #content
-    local _1_8   = content:sub(1, 8)
-    local _1_7   = content:sub(1, 7)
-    local _1_6   = content:sub(1, 6)
-    local _1_5   = content:sub(1, 5)
-    local _1_4   = content:sub(1, 4)
-    local _1_3   = content:sub(1, 3)
-    local _1_2   = content:sub(1, 2)
-    local _9_12  = content:sub(9, 12)
-    -- Images
-    if _1_4 == '\xC5\xD0\xD3\xC6' then
-        -- With a Windows-format EPS, the file starts right after a 30-byte
-        -- header, or a 30-byte header followed by two bytes of padding
-        if content:sub(33, 42) == '%!PS-Adobe' or content:sub(31, 40) == '%!PS-Adobe' then
-            return 'application/postscript'
-        end
-    end
-    if _1_8 == '%!PS-Ado' and content:sub(9, 10) == 'be' then
-        return 'application/postscript'
-    end
-    if _1_4 == 'MM\x00*' or _1_4 == 'II*\x00' then
-        return 'image/tiff'
-    end
-    if _1_8 == '\x89PNG\r\n\x1A\n' then
-        return 'image/png'
-    end
-    if _1_6 == 'GIF87a' or _1_6 == 'GIF89a' then
-        return 'image/gif'
-    end
-    if _1_4 == 'RIFF' and _9_12 == 'WEBP' then
-        return 'image/webp'
-    end
-    if _1_2 == 'BM' and length > 14 and in_table(content:sub(15, 15), {'\x0C', '(', '@', '\x80'}) then
-        return 'image/x-ms-bmp'
-    end
-    local normal_jpeg    = length > 10 and in_table(content:sub(7, 10), {'JFIF', 'Exif'})
-    local photoshop_jpeg = length > 24 and _1_4 == '\xFF\xD8\xFF\xED' and content:sub(21, 24) == '8BIM'
-    if normal_jpeg or photoshop_jpeg then
-        return 'image/jpeg'
-    end
-    if _1_4 == '8BPS' then
-        return 'image/vnd.adobe.photoshop'
-    end
-    if _1_8 == '\x00\x00\x00\x0CjP  ' and _9_12 == '\r\n\x87\n' then
-        return 'image/jp2'
-    end
-    if _1_4 == '\x00\x00\x01\x00' then
-        return 'application/vnd.microsoft.icon'
-    end
-    -- Audio/Video
-    if _1_4 == '\x1AE\xDF\xA3' and length > 1000 then
-        local mimetype, err = ebml_parse(content)
-        if mimetype then
-            return mimetype
-        end
-    end
-    if _1_4 == 'MOVI' then
-        if in_table(content:sub(5, 8), {'moov', 'mdat'}) then
-            return 'video/quicktime'
-        end
-    end
-    if length > 8 and content:sub(5, 8) == 'ftyp' then
-        local lower_9_12 = _9_12:lower()
-        if in_table(lower_9_12, {'avc1', 'isom', 'iso2', 'mp41', 'mp42', 'mmp4', 'ndsc', 'ndsh', 'ndsm', 'ndsp', 'ndss', 'ndxc', 'ndxh', 'ndxm', 'ndxp', 'ndxs', 'f4v ', 'f4p ', 'm4v '}) then
-            return 'video/mp4'
-        end
-        if in_table(lower_9_12, {'msnv', 'ndas', 'f4a ', 'f4b ', 'm4a ', 'm4b ', 'm4p '}) then
-            return 'audio/mp4'
-        end
-        if in_table(lower_9_12, {'3g2a', '3g2b', '3g2c', 'kddi'}) then
-            return 'video/3gpp2'
-        end
-        if in_table(lower_9_12, {'3ge6', '3ge7', '3gg6', '3gp1', '3gp2', '3gp3', '3gp4', '3gp5', '3gp6', '3gs7'}) then
-            return 'video/3gpp'
-        end
-        if lower_9_12 == 'mqt ' or lower_9_12 == 'qt  ' then
-            return 'video/quicktime'
-        end
-        if lower_9_12 == 'jp2 ' then
-            return 'image/jp2'
-        end
-    end
-    -- MP3
-    if bitwise_and(_1_2, '\xFF\xF6') == '\xFF\xF2' then
-        local byte_3 = content:sub(3, 3)
-        if bitwise_and(byte_3, '\xF0') ~= '\xF0' and bitwise_and(byte_3, "\x0C") ~= "\x0C" then
-            return 'audio/mpeg'
-        end
-    end
-    if _1_3 == 'ID3' then
-        return 'audio/mpeg'
-    end
-    if _1_4 == 'fLaC' then
-        return 'audio/x-flac'
-    end
-    if _1_8 == '0&\xB2u\x8Ef\xCF\x11' then
-        -- Without writing a full-on ASF parser, we can just scan for the
-        -- UTF-16 string "AspectRatio"
-        if content:find('\x00A\x00s\x00p\x00e\x00c\x00t\x00R\x00a\x00t\x00i\x00o', 1, true) then
-            return 'video/x-ms-wmv'
-        end
-        return 'audio/x-ms-wma'
-    end
-    if _1_4 == 'RIFF' and _9_12 == 'AVI ' then
-        return 'video/x-msvideo'
-    end
-    if _1_4 == 'RIFF' and _9_12 == 'WAVE' then
-        return 'audio/x-wav'
-    end
-    if _1_4 == 'FORM' and _9_12 == 'AIFF' then
-        return 'audio/x-aiff'
-    end
-    if _1_4 == 'OggS' then
-        local _29_33 = content:sub(29, 33)
-        if _29_33 == '\x01vorb' then
-            return 'audio/vorbis'
-        end
-        if _29_33 == '\x07FLAC' then
-            return 'audio/x-flac'
-        end
-        if _29_33 == 'OpusH' then
-            return 'audio/ogg'
-        end
-        -- Theora and OGM
-        if _29_33 == '\x80theo' or _29_33 == 'vide' then
-            return 'video/ogg'
-        end
-    end
-    if _1_3 == 'FWS' or _1_3 == 'CWS' then
-        return 'application/x-shockwave-flash'
-    end
-    if _1_3 == 'FLV' then
-        return 'video/x-flv'
-    end
-    if _1_5 == '%PDF-' then
-        return 'application/pdf'
-    end
-    if _1_5 == '{\\rtf' then
-        return 'text/rtf'
-    end
-    -- Office '97-2003 formats
-    if _1_8 == '\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1' then
-        if in_table(ext, {'xls', 'csv', 'tab'}) then
-            return 'application/vnd.ms-excel'
-        end
-        if ext == 'ppt' then
-            return 'application/vnd.ms-powerpoint'
-        end
-        -- We default to word since we need something if the extension isn't recognized
-        return 'application/msword'
-    end
-    if _1_8 == '\x09\x04\x06\x00\x00\x00\x10\x00' then
-        return 'application/vnd.ms-excel'
-    end
-    if _1_6 == '\xDB\xA5\x2D\x00\x00\x00' or _1_5 == '\x50\x4F\x5E\x51\x60' or _1_4 == '\xFE\x37\x00\x23' or _1_3 == '\x94\xA6\x2E' then
-        return 'application/msword'
-    end
-    if _1_4 == 'PK\x03\x04' then
-        -- Office XML formats
-        if ext == 'xlsx' then
-            return 'application/vnd.ms-excel'
-        end
-        if ext == 'pptx' then
-            return 'application/vnd.ms-powerpoint'
-        end
-        if ext == 'docx' then
-            return 'application/msword'
-        end
-        -- Open Office formats
-        if ext == 'ods' then
-            return 'application/vnd.oasis.opendocument.spreadsheet'
-        end
-        if ext == 'odp' then
-            return 'application/vnd.oasis.opendocument.presentation'
-        end
-        if ext == 'odt' then
-            return 'application/vnd.oasis.opendocument.text'
-        end
-        -- iWork - some programs like Mac Mail change the filename to
-        -- .numbers.zip, etc
-        if ext == 'pages' or ext == 'pages.zip' then
-          return 'application/vnd.apple.pages'
-        end
-        if ext == 'key' or ext == 'key.zip' then
-            return 'application/vnd.apple.keynote'
-        end
-        if ext == 'numbers' or ext == 'numbers.zip' then
-            return 'application/vnd.apple.numbers'
-        end
-        -- Otherwise just a zip
-        return 'application/zip'
-    end
-    -- Archives
-    if length > 257 then
-        if content:sub(258, 263) == 'ustar\x00' then
-            return 'application/x-tar'
-        end
-        if content:sub(258, 265) == 'ustar\x40\x40\x00' then
-            return 'application/x-tar'
-        end
-    end
-    if _1_7 == 'Rar!\x1A\x07\x00' or _1_8 == 'Rar!\x1A\x07\x01\x00' then
-        return 'application/x-rar-compressed'
-    end
-    if _1_2 == '\x1F\x9D' then
-        return 'application/x-compress'
-    end
-    if _1_2 == '\x1F\x8B' then
-        return 'application/x-gzip'
-    end
-    if _1_3 == 'BZh' then
-        return 'application/x-bzip2'
-    end
-    if _1_6 == '\xFD7zXZ\x00' then
-        return 'application/x-xz'
-    end
-    if _1_6 == '7z\xBC\xAF\x27\x1C' then
-        return 'application/x-7z-compressed'
-    end
-    if _1_2 == 'MZ' then
-        local pe_header_start = unpack_le(content:sub(61, 64))
-        local signature = content:sub(pe_header_start + 1, pe_header_start + 4)
-        if signature == 'PE\x00\x00' then
-            local image_file_header_start = pe_header_start + 5
-            local characteristics = content:sub(image_file_header_start + 18, image_file_header_start + 19)
-            local is_dll = bitwise_and(characteristics, '\x20\x00') == '\x20\x00'
-            if is_dll then
-                return 'application/x-msdownload'
-            end
-            return 'application/octet-stream'
-        end
-    end
-    return nil
-end
-function text_tests(content)
-    local lower_content = content:lower()
-    if content:find('^%%!PS-Adobe') then
-        return 'application/postscript'
-    end
-    if lower_content:find('<?php', 1, true) or content:find('<?=', 1, true) then
-        return 'application/x-httpd-php'
-    end
-    if lower_content:find('^%s*<%?xml') then
-        if content:find('<svg') then
-            return 'image/svg+xml'
-        end
-        if lower_content:find('<!doctype html') then
-            return 'application/xhtml+xml'
-        end
-        if content:find('<rss') then
-            return 'application/rss+xml'
-        end
-        return 'application/xml'
-    end
-    if lower_content:find('^%s*<html') or lower_content:find('^%s*<!doctype') then
-        return 'text/html'
-    end
-    if lower_content:find('^#![/a-z0-9]+ ?python') then
-        return 'application/x-python'
-    end
-    if lower_content:find('^#![/a-z0-9]+ ?perl') then
-        return 'application/x-perl'
-    end
-    if lower_content:find('^#![/a-z0-9]+ ?ruby') then
-        return 'application/x-ruby'
-    end
-    if lower_content:find('^#![/a-z0-9]+ ?php') then
-        return 'application/x-httpd-php'
-    end
-    if lower_content:find('^#![/a-z0-9]+ ?bash') then
-        return 'text/x-shellscript'
-    end
-    return nil
-end
-local ext_map = {
-    css   = 'text/css',
-    csv   = 'text/csv',
-    htm   = 'text/html',
-    html  = 'text/html',
-    xhtml = 'text/html',
-    ics   = 'text/calendar',
-    js    = 'application/javascript',
-    php   = 'application/x-httpd-php',
-    php3  = 'application/x-httpd-php',
-    php4  = 'application/x-httpd-php',
-    php5  = 'application/x-httpd-php',
-    inc   = 'application/x-httpd-php',
-    pl    = 'application/x-perl',
-    cgi   = 'application/x-perl',
-    py    = 'application/x-python',
-    rb    = 'application/x-ruby',
-    rhtml = 'application/x-ruby',
-    rss   = 'application/rss+xml',
-    sh    = 'text/x-shellscript',
-    tab   = 'text/tab-separated-values',
-    vcf   = 'text/x-vcard',
-    xml   = 'application/xml'
-}
-function ext_tests(ext)
-    local mimetype = ext_map[ext]
-    if mimetype then
-        return mimetype
-    end
-    return 'text/plain'
-end
-local _M = {}
-function _M.via_path(path, filename)
-    local f, err = io.open(path, 'r')
-    if not f then
-        return nil, err
-    end
-    local content = f:read(4096)
-    f:close()
-    if not filename then
-        filename = basename(path)
-    end
-    return _M.via_content(content, filename)
-end
-function _M.via_content(content, filename)
-    local ext = extension(filename)
-    -- If there are no low ASCII chars and no easily distinguishable tokens,
-    -- we need to detect by file extension
-    local mimetype = nil
-    mimetype = binary_tests(content, ext)
-    if mimetype then
-        return mimetype
-    end
-    -- Binary-looking files should have been detected so far
-    if content:find('[%z\x01-\x08\x0B\x0C\x0E-\x1F]') then
-        return 'application/octet-stream'
-    end
-    mimetype = text_tests(content)
-    if mimetype then
-        return mimetype
-    end
-    return ext_tests(ext)
-end
-return _M

src/_extensions/shafayetShafee/downloadthis/resources/css/downloadthis.css DELETED Viewed

@@ -1,13 +0,0 @@
-.downloadthis:focus,
-  .downloadthis:active  {
-     box-shadow: none !important;
-  }
-  .downloadthis:hover {
-     transition: 0.2s;
-     filter: brightness(0.90);
-  }
-  .downloadthis:active {
-     filter: brightness(0.80);
-}

src/_quarto.yml CHANGED Viewed

@@ -47,7 +47,7 @@ website:
           - href: 01_setup/glossar.qmd
             text: "Glossar"
           - href: 01_setup/ressourcen.qmd
-            text: "Ressourcen"
     - title: "No Code"
       contents:
         - href: basics.qmd
@@ -112,4 +112,4 @@ format:
   html:
     theme: cosmo
     css: styles.css
-    toc: true

           - href: 01_setup/glossar.qmd
             text: "Glossar"
           - href: 01_setup/ressourcen.qmd
+            text: "Weitere Ressourcen"
     - title: "No Code"
       contents:
         - href: basics.qmd
   html:
     theme: cosmo
     css: styles.css
+    toc: true

src/index.qmd CHANGED Viewed

@@ -6,7 +6,7 @@ Herzlich willkommen zum Webscraping Workshop! Egal, ob Erste-Schritte oder Fortg
 * Hast du ein Google Nutzerkonto?
 * Hast du ein Huggingface Nutzerkonto?
 * Hast du schon einmal Daten aus dem Internet extrahiert?
-* Hast du schonmal Daten über eine API bezogen?
 * Nutzt du Große Sprachmodelle?
 ## Navigation auf der Workshop Webseite 🧭

 * Hast du ein Google Nutzerkonto?
 * Hast du ein Huggingface Nutzerkonto?
 * Hast du schon einmal Daten aus dem Internet extrahiert?
+* Hast du schonmal Daten über eine [API](01_setup/glossar.html#a-wie-asynchrones-scraping) bezogen?
 * Nutzt du Große Sprachmodelle?
 ## Navigation auf der Workshop Webseite 🧭