Spaces:

spark-nlp
/

keyword-extraction

Sleeping

App Files Files Community

abdullahmubeen10 commited on Jul 20, 2024

Commit

1888eb1

verified ·

1 Parent(s): efd21b4

Update pages/Workflow & Model Overview.py

Browse files

Files changed (1) hide show

pages/Workflow & Model Overview.py +337 -235

pages/Workflow & Model Overview.py CHANGED Viewed

@@ -1,235 +1,337 @@
-import streamlit as st
-import pandas as pd
-# Custom CSS for better styling
-st.markdown("""
-    <style>
-        .main-title {
-            font-size: 36px;
-            color: #4A90E2;
-            font-weight: bold;
-            text-align: center;
-        }
-        .sub-title {
-            font-size: 24px;
-            color: #4A90E2;
-            margin-top: 20px;
-        }
-        .section {
-            background-color: #f9f9f9;
-            padding: 15px;
-            border-radius: 10px;
-            margin-top: 20px;
-        }
-        .section h2 {
-            font-size: 22px;
-            color: #4A90E2;
-        }
-        .section p, .section ul {
-            color: #666666;
-        }
-        .link {
-            color: #4A90E2;
-            text-decoration: none;
-        }
-    </style>
-""", unsafe_allow_html=True)
-# Introduction
-st.markdown('<div class="main-title">Keyword Extraction from Texts with Python and Spark NLP</div>', unsafe_allow_html=True)
-st.markdown("""
-<div class="section">
-    <p>Welcome to the Spark NLP Keyword Extraction Demo App! Keyword extraction is a technique in natural language processing (NLP) that involves automatically identifying the most important words or phrases in a document or corpus. Keywords extracted from a text can be used in a variety of ways, including:</p>
-    <ul>
-        <li>Document indexing</li>
-        <li>Document summarization</li>
-        <li>Content categorization</li>
-        <li>Content tagging</li>
-        <li>Search engine optimization</li>
-    </ul>
-    <p>This app demonstrates how to use Spark NLP's YakeKeywordExtraction annotator to perform keyword extraction using Python.</p>
-</div>
-""", unsafe_allow_html=True)
-# About Keyword Extraction
-st.markdown('<div class="sub-title">About Keyword Extraction</div>', unsafe_allow_html=True)
-st.markdown("""
-<div class="section">
-    <p>Extracting keywords from texts has become difficult for individuals and organizations as the complexity and volume of information have grown. The need to automate this task so that text can be processed promptly and adequately has led to the emergence of automatic keyword extraction tools. NLP and Python libraries help in the process.</p>
-</div>
-""", unsafe_allow_html=True)
-# Using YakeKeywordExtraction in Spark NLP
-st.markdown('<div class="sub-title">Using YakeKeywordExtraction in Spark NLP</div>', unsafe_allow_html=True)
-st.markdown("""
-<div class="section">
-    <p>Yake! is a novel feature-based system for multi-lingual keyword extraction, which supports texts of different sizes, domains, or languages. Unlike other approaches, Yake! does not rely on dictionaries or thesauri, nor is it trained against any corpora. Instead, it follows an unsupervised approach which builds upon features extracted from the text.</p>
-</div>
-""", unsafe_allow_html=True)
-st.markdown('<h2 class="sub-title">Example Usage in Python</h2>', unsafe_allow_html=True)
-st.markdown('<p>Here’s how you can implement keyword extraction using the YakeKeywordExtraction annotator in Spark NLP:</p>', unsafe_allow_html=True)
-# Setup Instructions
-st.markdown('<div class="sub-title">Setup</div>', unsafe_allow_html=True)
-st.markdown('<p>To install Spark NLP and extract keywords in Python, simply use your favorite package manager (conda, pip, etc.). For example:</p>', unsafe_allow_html=True)
-st.code("""
-pip install spark-nlp
-pip install pyspark
-""", language="bash")
-st.markdown("<p>Then, import Spark NLP and start a Spark session:</p>", unsafe_allow_html=True)
-st.code("""
-import sparknlp
-# Start Spark Session
-spark = sparknlp.start()
-""", language='python')
-# Keyword Extraction Example
-st.markdown('<div class="sub-title">Example Usage: Keyword Extraction with YakeKeywordExtraction</div>', unsafe_allow_html=True)
-st.code('''
-from sparknlp.base import DocumentAssembler, Pipeline
-from sparknlp.annotator import SentenceDetector, Tokenizer, YakeKeywordExtraction
-import pyspark.sql.functions as F
-# Step 1: Transforms raw texts to document annotation
-document = DocumentAssembler() \\
-    .setInputCol("text") \\
-    .setOutputCol("document")
-# Step 2: Sentence Detection
-sentenceDetector = SentenceDetector() \\
-    .setInputCols(["document"]) \\
-    .setOutputCol("sentence")
-# Step 3: Tokenization
-token = Tokenizer() \\
-    .setInputCols(["sentence"]) \\
-    .setOutputCol("token") \\
-    .setContextChars(["(", ")", "?", "!", ".", ","])
-# Step 4: Keyword Extraction
-keywords = YakeKeywordExtraction() \\
-    .setInputCols(["token"]) \\
-    .setOutputCol("keywords")
-# Define the pipeline
-yake_pipeline = Pipeline(stages=[document, sentenceDetector, token, keywords])
-# Create an empty dataframe
-empty_df = spark.createDataFrame([['']]).toDF("text")
-# Fit the dataframe to get the model
-yake_model = yake_pipeline.fit(empty_df)
-# Using LightPipeline
-from sparknlp.base import LightPipeline
-light_model = LightPipeline(yake_model)
-text = """
-google is acquiring data science community kaggle. Sources tell us that google is acquiring kaggle, a platform that hosts data science and machine learning competitions. Details about the transaction remain somewhat vague , but given that google is hosting its Cloud Next conference in san francisco this week, the official announcement could come as early as tomorrow. Reached by phone, kaggle co-founder ceo anthony goldbloom declined to deny that the acquisition is happening. google itself declined 'to comment on rumors'. kaggle, which has about half a million data scientists on its platform, was founded by Goldbloom and Ben Hamner in 2010. The service got an early start and even though it has a few competitors like DrivenData, TopCoder and HackerRank, it has managed to stay well ahead of them by focusing on its specific niche. The service is basically the de facto home for running data science and machine learning competitions. With kaggle, google is buying one of the largest and most active communities for data scientists - and with that, it will get increased mindshare in this community, too (though it already has plenty of that thanks to Tensorflow and other projects). kaggle has a bit of a history with google, too, but that's pretty recent. Earlier this month, google and kaggle teamed up to host a $100,000 machine learning competition around classifying YouTube videos. That competition had some deep integrations with the google Cloud platform, too. Our understanding is that google will keep the service running - likely under its current name. While the acquisition is probably more about Kaggle's community than technology, kaggle did build some interesting tools for hosting its competition and 'kernels', too. On kaggle, kernels are basically the source code for analyzing data sets and developers can share this code on the platform (the company previously called them 'scripts'). Like similar competition-centric sites, kaggle also runs a job board, too. It's unclear what google will do with that part of the service. According to Crunchbase, kaggle raised $12.5 million (though PitchBook says it's $12.75) since its launch in 2010. Investors in kaggle include Index Ventures, SV Angel, Max Levchin, Naval Ravikant, google chief economist Hal Varian, Khosla Ventures and Yuri Milner
-"""
-light_result = light_model.fullAnnotate(text)[0]
-import pandas as pd
-keys_df = pd.DataFrame([(k.result, k.begin, k.end, k.metadata['score'], k.metadata['sentence']) for k in light_result['keywords']],
-                       columns=['keywords', 'begin', 'end', 'score', 'sentence'])
-keys_df['score'] = keys_df['score'].astype(float)
-# ordered by relevance
-keys_df.sort_values(['sentence', 'score']).head(100)
-''', language='python')
-data = {
-    "Keyword": ["data science", "acquiring data", "google is acquiring", "community kaggle", "science community", "acquiring data science", "data science", "machine learning", "learning competitions", "acquiring kaggle", "google is acquiring", "hosts data", "science and machine", "google cloud", "cloud platform", "google cloud platform", "index ventures", "khosla ventures", "yuri milner", "sv angel", "max levchin", "naval ravikant", "hal varian", "cloud next", "next conference", "cloud next conference", "goldbloom declined", "anthony goldbloom", "data scientists", "ben hamner", "million data", "data science", "machine learning", "learning competitions", "running data", "science and machine", "data scientists", "machine learning"],
-    "Begin": [21, 11, 1, 34, 26, 11, 123, 140, 148, 83, 73, 117, 128, 1450, 1457, 1450, 2197, 2287, 2307, 2213, 2223, 2236, 2275, 262, 268, 262, 419, 411, 567, 629, 559, 895, 912, 920, 887, 900, 1024, 1333],
-    "End": [32, 24, 19, 49, 42, 32, 134, 155, 168, 98, 91, 126, 146, 1461, 1470, 1470, 2210, 2301, 2317, 2220, 2233, 2249, 2284, 271, 282, 282, 436, 427, 581, 638, 570, 906, 927, 940, 898, 918, 1038, 1348],
-    "Score": [0.255856, 0.844244, 1.039254, 1.040628, 1.152803, 1.263860, 0.255856, 0.466911, 0.762934, 0.849239, 1.039254, 1.203691, 1.257900, 0.611960, 0.796338, 1.070615, 0.904638, 0.904638, 1.008957, 1.045587, 1.045587, 1.045587, 1.045587, 0.514866, 0.994605, 1.242611, 1.078377, 1.323419, 0.562581, 0.623279, 1.210565, 0.255856, 0.466911, 0.762934, 1.183755, 1.257900, 0.562581, 0.466911],
-    "Sentence": [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 10, 10, 10, 17, 17, 17, 17, 17, 17, 17, 2, 2, 2, 3, 3, 4, 4, 4, 6, 6, 6, 6, 6, 7, 9]
-}
-df = pd.DataFrame(data)
-st.markdown(
-    """
-    <style>
-    .stTable {
-        margin-left: auto;
-        margin-right: auto;
-    }
-    </style>
-    """,
-    unsafe_allow_html=True
-)
-with st.expander("View Data Table"):
-    st.table(df)
-st.markdown("""
-<p>The code snippet demonstrates how to set up a pipeline in Spark NLP to perform keyword extraction on text data using the YakeKeywordExtraction annotator. The resulting DataFrame contains the keywords and their corresponding scores.</p>
-""", unsafe_allow_html=True)
-# Highlighting Keywords in a Text
-st.markdown('<div class="sub-title">Highlighting Keywords in a Text</div>', unsafe_allow_html=True)
-st.markdown("""
-<div class="section">
-    <p>In addition to getting the keywords as a dataframe, it is also possible to highlight the extracted keywords in the text.</p>
-    <p>In this example, a dataset of 7537 texts were used — samples from the PubMed, which is a free resource supporting the search and retrieval of biomedical and life sciences literature.</p>
-</div>
-""", unsafe_allow_html=True)
-st.code("""
-import re
-from pyspark.sql.functions import udf
-from pyspark.sql.types import StringType
-def highlight_keywords(text, keywords):
-    for keyword in keywords:
-        text = re.sub(fr'\\b{keyword}\\b', f'**{keyword}**', text, flags=re.IGNORECASE)
-    return text
-highlight_udf = udf(highlight_keywords, StringType())
-df_with_highlights = df.withColumn("highlighted_text", highlight_udf("text", "keywords"))
-df_with_highlights.select("highlighted_text").show(truncate=False)
-""", language='python')
-# Conclusion
-st.markdown('<div class="sub-title">Conclusion</div>', unsafe_allow_html=True)
-st.markdown("""
-<div class="section">
-    <p>In this demo, we demonstrated how to extract keywords from texts using the YakeKeywordExtraction annotator in Spark NLP. We provided step-by-step instructions on setting up the environment, creating a pipeline, and running the keyword extraction. Additionally, we explored how to highlight extracted keywords in the text.</p>
-</div>
-""", unsafe_allow_html=True)
-# References and Additional Information
-st.markdown('<div class="sub-title">For additional information, please check the following references.</div>', unsafe_allow_html=True)
-st.markdown("""
-<div class="section">
-    <ul>
-        <li>Documentation <a class="link" href="https://nlp.johnsnowlabs.com/docs/en/annotators#yakekeywordextraction" target="_blank" rel="noopener">YakeKeywordExtraction</a></li>
-        <li>Python keyword extraction: Docs about are <a class="link" href="https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/keyword_extraction/yake_keyword_extraction/index.html" target="_blank" rel="noopener">here</a></li>
-        <li>Scala Docs: <a class="link" href="https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/keyword/yake/YakeKeywordExtraction.html">YakeKeywordExtraction</a></li>
-        <li>For extended examples of usage, see the <a class="link" href="https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/8.Keyword_Extraction_YAKE.ipynb" target="_blank" rel="noopener nofollow">Spark NLP Workshop repository</a>.</li>
-        <li>Reference Paper: <a class="link" href="https://www.sciencedirect.com/science/article/abs/pii/S0020025519308588" target="_blank" rel="noopener nofollow">YAKE! Keyword extraction from single documents using multiple local features</a></li>
-        </ul>
-</div>
-""", unsafe_allow_html=True)
-st.markdown('<div class="sub-title">Community & Support</div>', unsafe_allow_html=True)
-st.markdown("""
-<div class="section">
-    <ul>
-        <li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>
-        <li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>
-        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>
-        <li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>
-        <li><a class="link" href="https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos" target="_blank">YouTube</a>: Video tutorials</li>
-    </ul>
-</div>
-""", unsafe_allow_html=True)

+import streamlit as st
+import pandas as pd
+# Custom CSS for better styling
+st.markdown("""
+    <style>
+        .main-title {
+            font-size: 36px;
+            color: #4A90E2;
+            font-weight: bold;
+            text-align: center;
+        }
+        .sub-title {
+            font-size: 24px;
+            color: #4A90E2;
+            margin-top: 20px;
+        }
+        .section {
+            background-color: #f9f9f9;
+            padding: 15px;
+            border-radius: 10px;
+            margin-top: 20px;
+        }
+        .section h2 {
+            font-size: 22px;
+            color: #4A90E2;
+        }
+        .section p, .section ul {
+            color: #666666;
+        }
+        .link {
+            color: #4A90E2;
+            text-decoration: none;
+        }
+    </style>
+""", unsafe_allow_html=True)
+# Introduction
+st.markdown('<div class="main-title">Keyword Extraction from Texts with Python and Spark NLP</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>Welcome to the Spark NLP Keyword Extraction Demo App! Keyword extraction is a technique in natural language processing (NLP) that involves automatically identifying the most important words or phrases in a document or corpus. Keywords extracted from a text can be used in a variety of ways, including:</p>
+    <ul>
+        <li>Document indexing</li>
+        <li>Document summarization</li>
+        <li>Content categorization</li>
+        <li>Content tagging</li>
+        <li>Search engine optimization</li>
+    </ul>
+    <p>This app demonstrates how to use Spark NLP's YakeKeywordExtraction annotator to perform keyword extraction using Python.</p>
+</div>
+""", unsafe_allow_html=True)
+# About Keyword Extraction
+st.markdown('<div class="sub-title">About Keyword Extraction</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>Extracting keywords from texts has become difficult for individuals and organizations as the complexity and volume of information have grown. The need to automate this task so that text can be processed promptly and adequately has led to the emergence of automatic keyword extraction tools. NLP and Python libraries help in the process.</p>
+</div>
+""", unsafe_allow_html=True)
+# Using YakeKeywordExtraction in Spark NLP
+st.markdown('<div class="sub-title">Using YakeKeywordExtraction in Spark NLP</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>Yake! is a novel feature-based system for multi-lingual keyword extraction, which supports texts of different sizes, domains, or languages. Unlike other approaches, Yake! does not rely on dictionaries or thesauri, nor is it trained against any corpora. Instead, it follows an unsupervised approach which builds upon features extracted from the text.</p>
+</div>
+""", unsafe_allow_html=True)
+st.markdown('<h2 class="sub-title">Example Usage in Python</h2>', unsafe_allow_html=True)
+st.markdown('<p>Here’s how you can implement keyword extraction using the YakeKeywordExtraction annotator in Spark NLP:</p>', unsafe_allow_html=True)
+# Setup Instructions
+st.markdown('<div class="sub-title">Setup</div>', unsafe_allow_html=True)
+st.markdown('<p>To install Spark NLP and extract keywords in Python, simply use your favorite package manager (conda, pip, etc.). For example:</p>', unsafe_allow_html=True)
+st.code("""
+pip install spark-nlp
+pip install pyspark
+""", language="bash")
+st.markdown("<p>Then, import Spark NLP and start a Spark session:</p>", unsafe_allow_html=True)
+st.code("""
+import sparknlp
+# Start Spark Session
+spark = sparknlp.start()
+""", language='python')
+# Keyword Extraction Example
+st.markdown('<div class="sub-title">Example Usage: Keyword Extraction with YakeKeywordExtraction</div>', unsafe_allow_html=True)
+st.code('''
+from sparknlp.base import DocumentAssembler, Pipeline
+from sparknlp.annotator import SentenceDetector, Tokenizer, YakeKeywordExtraction
+import pyspark.sql.functions as F
+# Step 1: Transforms raw texts to document annotation
+document = DocumentAssembler() \\
+    .setInputCol("text") \\
+    .setOutputCol("document")
+# Step 2: Sentence Detection
+sentenceDetector = SentenceDetector() \\
+    .setInputCols(["document"]) \\
+    .setOutputCol("sentence")
+# Step 3: Tokenization
+token = Tokenizer() \\
+    .setInputCols(["sentence"]) \\
+    .setOutputCol("token") \\
+    .setContextChars(["(", ")", "?", "!", ".", ","])
+# Step 4: Keyword Extraction
+keywords = YakeKeywordExtraction() \\
+    .setInputCols(["token"]) \\
+    .setOutputCol("keywords")
+# Define the pipeline
+yake_pipeline = Pipeline(stages=[document, sentenceDetector, token, keywords])
+# Create an empty dataframe
+empty_df = spark.createDataFrame([['']]).toDF("text")
+# Fit the dataframe to get the model
+yake_model = yake_pipeline.fit(empty_df)
+# Using LightPipeline
+from sparknlp.base import LightPipeline
+light_model = LightPipeline(yake_model)
+text = """
+google is acquiring data science community kaggle. Sources tell us that google is acquiring kaggle, a platform that hosts data science and machine learning competitions. Details about the transaction remain somewhat vague , but given that google is hosting its Cloud Next conference in san francisco this week, the official announcement could come as early as tomorrow. Reached by phone, kaggle co-founder ceo anthony goldbloom declined to deny that the acquisition is happening. google itself declined 'to comment on rumors'. kaggle, which has about half a million data scientists on its platform, was founded by Goldbloom and Ben Hamner in 2010. The service got an early start and even though it has a few competitors like DrivenData, TopCoder and HackerRank, it has managed to stay well ahead of them by focusing on its specific niche. The service is basically the de facto home for running data science and machine learning competitions. With kaggle, google is buying one of the largest and most active communities for data scientists - and with that, it will get increased mindshare in this community, too (though it already has plenty of that thanks to Tensorflow and other projects). kaggle has a bit of a history with google, too, but that's pretty recent. Earlier this month, google and kaggle teamed up to host a $100,000 machine learning competition around classifying YouTube videos. That competition had some deep integrations with the google Cloud platform, too. Our understanding is that google will keep the service running - likely under its current name. While the acquisition is probably more about Kaggle's community than technology, kaggle did build some interesting tools for hosting its competition and 'kernels', too. On kaggle, kernels are basically the source code for analyzing data sets and developers can share this code on the platform (the company previously called them 'scripts'). Like similar competition-centric sites, kaggle also runs a job board, too. It's unclear what google will do with that part of the service. According to Crunchbase, kaggle raised $12.5 million (though PitchBook says it's $12.75) since its launch in 2010. Investors in kaggle include Index Ventures, SV Angel, Max Levchin, Naval Ravikant, google chief economist Hal Varian, Khosla Ventures and Yuri Milner
+"""
+light_result = light_model.fullAnnotate(text)[0]
+import pandas as pd
+keys_df = pd.DataFrame([(k.result, k.begin, k.end, k.metadata['score'], k.metadata['sentence']) for k in light_result['keywords']],
+                       columns=['keywords', 'begin', 'end', 'score', 'sentence'])
+keys_df['score'] = keys_df['score'].astype(float)
+# ordered by relevance
+keys_df.sort_values(['sentence', 'score']).head(100)
+''', language='python')
+data = {
+    "Keyword": ["data science", "acquiring data", "google is acquiring", "community kaggle", "science community", "acquiring data science", "data science", "machine learning", "learning competitions", "acquiring kaggle", "google is acquiring", "hosts data", "science and machine", "google cloud", "cloud platform", "google cloud platform", "index ventures", "khosla ventures", "yuri milner", "sv angel", "max levchin", "naval ravikant", "hal varian", "cloud next", "next conference", "cloud next conference", "goldbloom declined", "anthony goldbloom", "data scientists", "ben hamner", "million data", "data science", "machine learning", "learning competitions", "running data", "science and machine", "data scientists", "machine learning"],
+    "Begin": [21, 11, 1, 34, 26, 11, 123, 140, 148, 83, 73, 117, 128, 1450, 1457, 1450, 2197, 2287, 2307, 2213, 2223, 2236, 2275, 262, 268, 262, 419, 411, 567, 629, 559, 895, 912, 920, 887, 900, 1024, 1333],
+    "End": [32, 24, 19, 49, 42, 32, 134, 155, 168, 98, 91, 126, 146, 1461, 1470, 1470, 2210, 2301, 2317, 2220, 2233, 2249, 2284, 271, 282, 282, 436, 427, 581, 638, 570, 906, 927, 940, 898, 918, 1038, 1348],
+    "Score": [0.255856, 0.844244, 1.039254, 1.040628, 1.152803, 1.263860, 0.255856, 0.466911, 0.762934, 0.849239, 1.039254, 1.203691, 1.257900, 0.611960, 0.796338, 1.070615, 0.904638, 0.904638, 1.008957, 1.045587, 1.045587, 1.045587, 1.045587, 0.514866, 0.994605, 1.242611, 1.078377, 1.323419, 0.562581, 0.623279, 1.210565, 0.255856, 0.466911, 0.762934, 1.183755, 1.257900, 0.562581, 0.466911],
+    "Sentence": [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 10, 10, 10, 17, 17, 17, 17, 17, 17, 17, 2, 2, 2, 3, 3, 4, 4, 4, 6, 6, 6, 6, 6, 7, 9]
+}
+df = pd.DataFrame(data)
+st.markdown(
+    """
+    <style>
+    .stTable {
+        margin-left: auto;
+        margin-right: auto;
+    }
+    </style>
+    """,
+    unsafe_allow_html=True
+)
+with st.expander("View Data Table"):
+    st.table(df)
+st.markdown("""
+<p>The code snippet demonstrates how to set up a pipeline in Spark NLP to perform keyword extraction on text data using the YakeKeywordExtraction annotator. The resulting DataFrame contains the keywords and their corresponding scores.</p>
+""", unsafe_allow_html=True)
+# Highlighting Keywords in a Text
+st.markdown('<div class="sub-title">Highlighting Keywords in a Text</div>', unsafe_allow_html=True)
+st.markdown("""
+<p>In addition to getting the keywords as a dataframe, it is also possible to highlight the extracted keywords in the text.</p>
+<p>In this example, a dataset of 7537 texts were used — samples from the PubMed, which is a free resource supporting the search and retrieval of biomedical and life sciences literature.</p>
+""", unsafe_allow_html=True)
+st.code("""
+!wget -q https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed_sample_text_small.csv
+df = spark.read\\
+                .option("header", "true")\\
+                .csv("pubmed_sample_text_small.csv")\\
+df.show(truncate=False)
+""", language='python')
+st.text('''
++---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+|text                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
++---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+|The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+|BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes. METHODS: Vinorelbinewas administered at a dose level of 25 mg/m(2) intravenously on days 1 and 8 of a3 week cycle. Patients were given three or more cycles in the absence of tumorprogression. A maximum of nine cycles were administered. RESULTS: The responserate in 50 evaluable patients was 20.0% (10 out of 50; 95% confidence interval,10.0-33.7%). Responders plus those who had minor response (MR) or no change (NC) accounted for 58.0% [10 partial responses (PRs) + one MR + 18 NCs out of 50]. TheKaplan-Meier estimate (50% point) of time to progression (TTP) was 115.0 days.The response rate in the visceral organs was 17.3% (nine PRs out of 52). Themajor toxicity was myelosuppression, which was reversible and did not requirediscontinuation of treatment. CONCLUSION: The results of this study show thatvinorelbine monotherapy is useful in patients with advanced or recurrent breastcancer previously exposed to both anthracyclines and taxanes.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+|OBJECTIVE: To investigate the relationship between preoperative atrialfibrillation and early and late clinical outcomes following cardiac surgery.METHODS: A retrospective cohort including all consecutive coronary artery bypass graft and/or valve surgery patients between 1995 and 2005 was identified (n =9796). No patient had a concomitant surgical AF ablation. The association betweenpreoperative atrial fibrillation and in-hospital outcomes was examined. We alsodetermined late death and cardiovascular-related re-hospitalization by linking toadministrative health databases. Median follow-up was 2.9 years (maximum 11years). RESULTS: The prevalence of preoperative atrial fibrillation was 11.3% (n = 1105), ranging from 7.2% in isolated CABG to 30% in valve surgery. In-hospital mortality, stroke, and renal failure were more common in atrial fibrillationpatients (all p < 0.0001), although the association between atrial fibrillationand mortality was not statistically significant in multivariate logisticregression. Longitudinal analyses showed that preoperative atrial fibrillationwas associated with decreased event-free survival (adjusted hazard ratio 1.55,95% confidence interval 1.42-1.70, p < 0.0001). CONCLUSIONS: Preoperative atrial fibrillation is associated with increased late mortality and recurrentcardiovascular events post-cardiac surgery. Effective management strategies foratrial fibrillation need to be explored and may provide an opportunity to improvethe long-term outcomes of cardiac surgical patients.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+|Combined EEG/fMRI recording has been used to localize the generators of EEGevents and to identify subject state in cognitive studies and is of increasinginterest. However, the large EEG artifacts induced during fMRI have precludedsimultaneous EEG and fMRI recording, restricting study design. Removing thisartifact is difficult, as it normally exceeds EEG significantly and containscomponents in the EEG frequency range. We have developed a recording system andan artifact reduction method that reduce this artifact effectively. The recordingsystem has large dynamic range to capture both low-amplitude EEG and largeimaging artifact without distortion (resolution 2 microV, range 33.3 mV), 5-kHzsampling, and low-pass filtering prior to the main gain stage. Imaging artifactis reduced by subtracting an averaged artifact waveform, followed by adaptivenoise cancellation to reduce any residual artifact. This method was validated in recordings from five subjects using periodic and continuous fMRI sequences.Spectral analysis revealed differences of only 10 to 18% between EEG recorded in the scanner without fMRI and the corrected EEG. Ninety-nine percent of spikewaves (median 74 microV) added to the recordings were identified in the correctedEEG compared to 12% in the uncorrected EEG. The median noise after artifactreduction was 8 microV. All these measures indicate that most of the artifact wasremoved, with minimal EEG distortion. Using this recording system and artifactreduction method, we have demonstrated that simultaneous EEG/fMRI studies are forthe first time possible, extending the scope of EEG/fMRI studies considerably.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
+|Kohlschutter syndrome is a rare neurodegenerative disorder presenting withintractable seizures, developmental regression and characteristic hypoplasticdental enamel indicative of amelogenesis imperfecta. We report a new family with two affected siblings.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+|Statistical analysis of neuroimages is commonly approached with intergroupcomparisons made by repeated application of univariate or multivariate testsperformed on the set of the regions of interest sampled in the acquired images.The use of such large numbers of tests requires application of techniques forcorrection for multiple comparisons. Standard multiple comparison adjustments(such as the Bonferroni) may be overly conservative when data are correlatedand/or not normally distributed. Resampling-based step-down procedures thatsuccessfully account for unknown correlation structures in the data have recentlybeen introduced. We combined resampling step-down procedures with the MinimumVariance Adaptive method, which allows selection of an optimal test statisticfrom a predefined class of statistics for the data under analysis. As shown insimulation studies and analysis of autoradiographic data, the combined technique exhibits a significant increase in statistical power, even for small sample sizes(n = 8, 9, 10).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+|The synthetic DOX-LNA conjugate was characterized by proton nuclear magneticresonance and mass spectrometry. In addition, the purity of the conjugate wasanalyzed by reverse-phase high-performance liquid chromatography. The cellularuptake, intracellular distribution, and cytotoxicity of DOX-LNA were assessed by flow cytometry, fluorescence microscopy, liquid chromatography/electrosprayionization tandem mass spectrometry, and the tetrazolium dye assay using the invitro cell models. The DOX-LNA conjugate showed substantially highertumor-specific cytotoxicity compared with DOX.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+|Our objective was to compare three different methods of blood pressuremeasurement through the results of a controlled study aimed at comparing theantihypertensive effects of trandolapril and losartan. Two hundred andtwenty-nine hypertensive patients were randomized in a double-blind parallelgroup study. After a 3-week placebo period, they received either 2 mgtrandolapril or 50 mg losartan once daily for 6 weeks. At the end of both placeboand active treatment periods, three methods of blood pressure measurement wereused: a) office blood pressure (three consecutive measurements); b) home selfblood pressure measurements (SBPM), consisting of three consecutive measurements performed at home in the morning and in the evening for 7 consecutive days; andc) ambulatory blood pressure measurements (ABPM), 24-h BP recordings with threemeasurements per hour. Of the 229 patients, 199 (87%) performed at least 12 validSBPM measurements during both placebo and treatment periods, whereas only 160(70%) performed good quality 24-h ABPM recordings during both periods (P <.0001). One hundred-forty patients performed the three methods of measurementwell. At baseline and with treatment, agreement between office measurements andABPM or SBPM was weak. Conversely, there was a good agreement between ABPM andSBPM. The mean difference (SBP/DBP) between ABPM and SBPM was 4.6 +/- 10.4/3.5+/- 7.1 at baseline and 3.5 +/- 10.0/4.0 +/- 7.0 at the end of the treatmentperiod. The correlation between SBPM and ABPM expressed by the r coefficient and the P values were the following: at baseline 0.79/0.70 (< 0.001/< .0001), withactive treatment 0.74/0.69 (0.0001/.0001). Hourly and 24-h reproducibility ofblood pressure response was quantified by the standard deviation of BP response. Compared with office blood pressure, both global and hourly SBPM responsesexhibited a lower standard deviation. Hourly reproducibility of SBPM response(10.8 mm Hg/6.9 mm Hg) was lower than hourly reproducibility of ABPM response(15.6 mm Hg/11.9 mm Hg). In conclusion, SBPM was easier to perform than ABPM.There was a good agreement between these two methods whereas concordance between SBPM or ABPM and office measurements was weak. As hourly reproducibility of SBPM response is better than reproducibility of both hourly ABPM and office BPresponse, SBPM seems to be the most appropriate method for evaluating residualantihypertensive effect.|
+|We conducted a phase II study to assess the efficacy and tolerability ofirinotecan and cisplatin as salvage chemotherapy in patients with advancedgastric adenocarcinoma, progressing after both 5-fluorouracil (5-FU)- andtaxane-containing regimen. Patients with measurable metastatic gastric cancer,progressive after previous chemotherapy that consisted either of a 5-FU-basedregimen followed by second-line chemotherapy containing taxanes or a 5-FU andtaxane combination were treated with irinotecan and cisplatin. Irinotecan 70mg/m(2) was administered on day 1 and day 15; cisplatin 70 mg/m(2) wasadministered on day 1. Treatment was repeated every 4 weeks. For 28 patientsregistered, a total of 94 chemotherapy cycles were administered. The patients'median age was 51 years and 27 (96%) had an ECOG performance status of 1 orbelow. In an intent-to-treat analysis, seven patients (25%) achieved a partialresponse, which maintained for 6.3 months (95% confidence interval 6.2-6.4months). The median progression-free and overall survival were 3.5 and 5.6months, respectively. Major toxic effects included nausea, diarrhea andneurotoxicity. Although there was one possible treatment-related death, toxicity profiles were generally predictable and manageable. We conclude that irinotecanand cisplatin is an active combination for patients with metastatic gastriccancer in whom previous chemotherapy with 5-FU and taxanes has failed.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+|"""Monomeric sarcosine oxidase (MSOX) is a flavoenzyme that catalyzes the oxidative demethylation of sarcosine (N-methylglycine) to yield glycine                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
++---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+''')
+st.code("""
+import re
+from pyspark.sql import functions as F
+from pyspark.sql.types import StringType
+from pyspark.sql.functions import udf, array_distinct
+from IPython.display import HTML, display
+# Fit and transform the dataframe to get the predictions
+result = yake_pipeline.fit(df).transform(df)
+result = result.withColumn('unique_keywords', F.array_distinct("keywords.result"))
+def highlight(text, keywords):
+    for k in keywords:
+        # Escape HTML characters in the keywords
+        k = re.escape(k)
+        # Use <mark> tag to highlight
+        text = re.sub(r'(\b%s\b)' % k, r'<mark>\1</mark>', text, flags=re.IGNORECASE)
+    return text
+highlight_udf = udf(highlight, StringType())
+result = result.withColumn("highlighted_keywords", highlight_udf('text', 'unique_keywords'))
+for r in result.select("highlighted_keywords").limit(20).collect():
+    # Display the HTML content
+    display(HTML(r.highlighted_keywords))
+    print("\\n\\n")
+""", language='python')
+with st.expander("View Output"):
+    st.markdown('''
+        <div id="output-area"><span id="output-header"> </span><div id="output-body"><div class="display_data output-id-1"><div class="output_subarea output_html rendered_html">The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated <mark>inwardly rectifying</mark> potassium (GIRK) <mark>channel family</mark>. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a <mark>candidate gene</mark> <mark>forType II</mark> <mark>diabetes mellitus</mark> in the <mark>Pima Indian</mark> population. The <mark>gene spansapproximately</mark> 7.6 kb and <mark>contains one</mark> noncoding and <mark>two coding</mark> <mark>exons separated</mark> byapproximately 2.2 and approximately 2.6 <mark>kb introns</mark>, respectively. We identified14 <mark>single nucleotide</mark> polymorphisms (SNPs), <mark>including one</mark> that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various <mark>humantissues including</mark> pancreas, and <mark>two major</mark> insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the <mark>locus in Type</mark> <mark>II diabetes</mark>.</div></div><div class="stream output-id-2"><div class="output_subarea output_text"><pre>
+        </pre></div></div><div class="display_data output-id-3"><div class="output_subarea output_html rendered_html">BACKGROUND: At present, it is one of the most <mark>important issues</mark> for the treatment of <mark>breast cancer</mark> to develop the <mark>standard therapy</mark> for <mark>patients previously</mark> <mark>treated with anthracyclines</mark> and taxanes. With the objective of determining the <mark>usefulnessof vinorelbine</mark> monotherapy in <mark>patients with advanced</mark> or <mark>recurrent breast</mark> cancerafter <mark>standard therapy</mark>, we evaluated the efficacy and safety of <mark>vinorelbine inpatients</mark> <mark>previously treated</mark> with <mark>anthracyclines and taxanes</mark>. METHODS: <mark>Vinorelbinewas administered</mark> at a <mark>dose level</mark> of 25 mg/m(2) intravenously on days 1 and 8 of a3 week cycle. Patients were given three or more cycles in the absence of tumorprogression. A maximum of <mark>nine cycles</mark> were administered. RESULTS: The responserate in 50 <mark>evaluable patients</mark> was 20.0% (10 out of 50; 95% confidence interval,10.0-33.7%). Responders plus those who had minor response (MR) or no change (NC) accounted for 58.0% [10 partial responses (PRs) + <mark>one MR</mark> + 18 NCs out of 50]. TheKaplan-Meier estimate (50% point) of time to progression (TTP) was 115.0 days.The response rate in the visceral organs was 17.3% (<mark>nine PRs</mark> out of 52). Themajor toxicity was myelosuppression, which was reversible and did not requirediscontinuation of treatment. CONCLUSION: The results of this study show thatvinorelbine monotherapy is useful in <mark>patients with advanced</mark> or recurrent <mark>breastcancer previously</mark> exposed to both <mark>anthracyclines and taxanes</mark>.</div></div><div class="stream output-id-4"><div class="output_subarea output_text"><pre>
+        </pre></div></div><div class="display_data output-id-5"><div class="output_subarea output_html rendered_html">OBJECTIVE: To investigate the relationship between <mark>preoperative atrialfibrillation</mark> and early and <mark>late clinical</mark> <mark>outcomes following</mark> cardiac surgery.METHODS: A <mark>retrospective cohort</mark> including all <mark>consecutive coronary</mark> <mark>artery bypass</mark> graft and/or <mark>valve surgery</mark> patients between 1995 and 2005 was identified (n =9796). No patient had a <mark>concomitant surgical</mark> <mark>AF ablation</mark>. The association <mark>betweenpreoperative atrial</mark> fibrillation and in-hospital outcomes was examined. We <mark>alsodetermined late</mark> death and cardiovascular-related re-hospitalization by linking toadministrative health databases. Median follow-up was 2.9 years (maximum 11years). RESULTS: The prevalence of preoperative <mark>atrial fibrillation</mark> was 11.3% (n = 1105), ranging from 7.2% in <mark>isolated CABG</mark> to 30% in <mark>valve surgery</mark>. In-hospital mortality, stroke, and renal failure were more common in <mark>atrial fibrillationpatients</mark> (all p &lt; 0.0001), although the association between <mark>atrial fibrillationand</mark> mortality was not statistically significant in multivariate logisticregression. Longitudinal analyses showed that <mark>preoperative atrial</mark> fibrillationwas associated with decreased event-free survival (adjusted hazard ratio 1.55,95% confidence interval 1.42-1.70, p &lt; 0.0001). CONCLUSIONS: Preoperative <mark>atrial fibrillation</mark> is associated with <mark>increased late</mark> mortality and recurrentcardiovascular events post-cardiac surgery. Effective management strategies foratrial fibrillation need to be explored and may provide an opportunity to improvethe long-term outcomes of <mark>cardiac surgical</mark> patients.</div></div><div class="stream output-id-6"><div class="output_subarea output_text"><pre>
+        </pre></div></div><div class="display_data output-id-7"><div class="output_subarea output_html rendered_html">Combined EEG/<mark>fMRI recording</mark> has been used to localize the generators of EEGevents and to identify subject state in cognitive studies and is of increasinginterest. However, the <mark>large EEG</mark> artifacts induced during fMRI have <mark>precludedsimultaneous EEG</mark> and <mark>fMRI recording</mark>, restricting study design. Removing thisartifact is difficult, as it normally <mark>exceeds EEG</mark> significantly and containscomponents in the <mark>EEG frequency</mark> range. We have developed a <mark>recording system</mark> <mark>andan artifact</mark> reduction method that reduce this <mark>artifact effectively</mark>. The recordingsystem has large dynamic range to capture both low-amplitude EEG and <mark>largeimaging artifact</mark> without distortion (resolution 2 microV, range 33.3 mV), 5-kHzsampling, and low-pass filtering prior to the main gain stage. Imaging artifactis reduced by subtracting an <mark>averaged artifact</mark> waveform, followed by adaptivenoise cancellation to reduce any <mark>residual artifact</mark>. This method was validated in recordings from five subjects using periodic and continuous fMRI sequences.Spectral analysis revealed differences of only 10 to 18% between <mark>EEG recorded</mark> in the scanner <mark>without fMRI</mark> and the <mark>corrected EEG</mark>. Ninety-nine percent of spikewaves (median 74 microV) added to the recordings were identified in the correctedEEG compared to 12% in the <mark>uncorrected EEG</mark>. The median noise after artifactreduction was 8 microV. All these measures indicate that most of the <mark>artifact wasremoved</mark>, with <mark>minimal EEG</mark> distortion. Using this <mark>recording system</mark> and artifactreduction method, we have demonstrated that simultaneous EEG/fMRI studies are forthe first time possible, extending the scope of EEG/fMRI studies considerably.</div></div><div class="stream output-id-8"><div class="output_subarea output_text"><pre>
+        </pre></div></div><div class="display_data output-id-9"><div class="output_subarea output_html rendered_html"><mark>Kohlschutter syndrome</mark> is a <mark>rare neurodegenerative</mark> <mark>disorder presenting</mark> <mark>withintractable seizures</mark>, <mark>developmental regression</mark> and <mark>characteristic hypoplasticdental</mark> <mark>enamel indicative</mark> of <mark>amelogenesis imperfecta</mark>. We report a <mark>new family</mark> with <mark>two affected</mark> siblings.</div></div><div class="stream output-id-10"><div class="output_subarea output_text"><pre>
+        </pre></div></div><div class="display_data output-id-11"><div class="output_subarea output_html rendered_html"><mark>Statistical analysis</mark> of neuroimages is <mark>commonly approached</mark> with <mark>intergroupcomparisons made</mark> by <mark>repeated application</mark> of univariate or <mark>multivariate testsperformed</mark> on the set of the regions of <mark>interest sampled</mark> in the acquired images.The use of such <mark>large numbers</mark> of <mark>tests requires</mark> application of <mark>techniques forcorrection</mark> for <mark>multiple comparisons</mark>. <mark>Standard multiple</mark> comparison adjustments(such as the Bonferroni) may be <mark>overly conservative</mark> when data are correlatedand/or not <mark>normally distributed</mark>. Resampling-based step-down <mark>procedures thatsuccessfully</mark> account for <mark>unknown correlation</mark> structures in the data have <mark>recentlybeen introduced</mark>. We <mark>combined resampling</mark> step-down procedures with the <mark>MinimumVariance Adaptive</mark> method, which allows selection of an optimal test statisticfrom a predefined class of statistics for the <mark>data under analysis</mark>. As shown insimulation studies and analysis of <mark>autoradiographic data</mark>, the <mark>combined technique</mark> exhibits a significant increase in <mark>statistical power</mark>, even for small sample sizes(n = 8, 9, 10).</div></div><div class="stream output-id-12"><div class="output_subarea output_text"><pre>
+        </pre></div></div><div class="display_data output-id-13"><div class="output_subarea output_html rendered_html">The synthetic DOX-LNA <mark>conjugate was characterized</mark> by <mark>proton nuclear</mark> magneticresonance and <mark>mass spectrometry</mark>. In addition, the purity of the <mark>conjugate wasanalyzed</mark> by reverse-phase high-performance <mark>liquid chromatography</mark>. The cellularuptake, <mark>intracellular distribution</mark>, and cytotoxicity of DOX-LNA were assessed by <mark>flow cytometry</mark>, <mark>fluorescence microscopy</mark>, <mark>liquid chromatography</mark>/electrosprayionization tandem <mark>mass spectrometry</mark>, and the <mark>tetrazolium dye</mark> <mark>assay using</mark> the <mark>invitro cell</mark> models. The DOX-LNA <mark>conjugate showed</mark> substantially highertumor-specific <mark>cytotoxicity compared</mark> with DOX.</div></div><div class="stream output-id-14"><div class="output_subarea output_text"><pre>
+        </pre></div></div><div class="display_data output-id-15"><div class="output_subarea output_html rendered_html">Our objective was to <mark>compare three</mark> different <mark>methods of blood</mark> pressuremeasurement through the results of a controlled study aimed at comparing theantihypertensive effects of trandolapril and losartan. Two hundred andtwenty-nine hypertensive patients were randomized in a double-blind parallelgroup study. After a 3-week placebo period, they received either 2 mgtrandolapril or 50 mg losartan once daily for 6 weeks. At the end of both placeboand active <mark>treatment periods</mark>, <mark>three methods</mark> of <mark>blood pressure</mark> measurement wereused: a) office <mark>blood pressure</mark> (<mark>three consecutive</mark> measurements); b) home selfblood <mark>pressure measurements</mark> (SBPM), consisting of <mark>three consecutive</mark> <mark>measurements performed</mark> at home in the morning and in the evening for 7 consecutive days; andc) ambulatory <mark>blood pressure</mark> measurements (ABPM), 24-h <mark>BP recordings</mark> with threemeasurements per hour. Of the 229 patients, 199 (87%) performed at least 12 validSBPM measurements during both placebo and <mark>treatment periods</mark>, whereas only 160(70%) performed good quality 24-h <mark>ABPM recordings</mark> during both periods (P &lt;.0001). One hundred-forty patients performed the <mark>three methods</mark> of measurementwell. At baseline and with treatment, agreement between <mark>office measurements</mark> andABPM or SBPM was weak. Conversely, there was a <mark>good agreement</mark> between <mark>ABPM andSBPM</mark>. The mean difference (SBP/DBP) between <mark>ABPM and SBPM</mark> was 4.6 +/- 10.4/3.5+/- 7.1 at baseline and 3.5 +/- 10.0/4.0 +/- 7.0 at the end of the treatmentperiod. The correlation between SBPM and <mark>ABPM expressed</mark> by the r coefficient and the P values were the following: at baseline 0.79/0.70 (&lt; 0.001/&lt; .0001), withactive treatment 0.74/0.69 (0.0001/.0001). Hourly and 24-h reproducibility ofblood pressure response was quantified by the standard deviation of <mark>BP response</mark>. Compared with office <mark>blood pressure</mark>, both global and <mark>hourly SBPM</mark> responsesexhibited a lower standard deviation. <mark>Hourly reproducibility</mark> of <mark>SBPM response</mark>(10.8 <mark>mm Hg</mark>/6.9 <mark>mm Hg</mark>) was lower than <mark>hourly reproducibility</mark> of ABPM response(15.6 <mark>mm Hg</mark>/11.9 <mark>mm Hg</mark>). In conclusion, SBPM was easier to perform than ABPM.There was a <mark>good agreement</mark> between these two methods whereas concordance between <mark>SBPM or ABPM</mark> and <mark>office measurements</mark> was weak. As <mark>hourly reproducibility</mark> of <mark>SBPM response</mark> is better than reproducibility of both <mark>hourly ABPM</mark> and office BPresponse, <mark>SBPM seems</mark> to be the most appropriate method for evaluating residualantihypertensive effect.</div></div><div class="stream output-id-16"><div class="output_subarea output_text"><pre>
+        </pre></div></div><div class="display_data output-id-17"><div class="output_subarea output_html rendered_html">We conducted a <mark>phase II</mark> study to assess the efficacy and <mark>tolerability ofirinotecan</mark> and cisplatin as <mark>salvage chemotherapy</mark> in patients with <mark>advancedgastric adenocarcinoma</mark>, progressing after both 5-fluorouracil (5-FU)- andtaxane-containing regimen. Patients with <mark>measurable metastatic</mark> gastric cancer,progressive after <mark>previous chemotherapy</mark> that <mark>consisted either</mark> of a 5-FU-basedregimen followed by second-line <mark>chemotherapy containing</mark> taxanes or a 5-FU <mark>andtaxane combination</mark> were treated with <mark>irinotecan and cisplatin</mark>. Irinotecan 70mg/m(2) was <mark>administered on day</mark> 1 and day 15; cisplatin 70 mg/m(2) wasadministered on day 1. Treatment was <mark>repeated every</mark> 4 weeks. For 28 patientsregistered, a total of 94 <mark>chemotherapy cycles</mark> were administered. The patients'median age was 51 years and 27 (96%) had an <mark>ECOG performance</mark> status of 1 orbelow. In an intent-to-treat analysis, <mark>seven patients</mark> (25%) achieved a partialresponse, which maintained for 6.3 months (95% confidence interval 6.2-6.4months). The median progression-free and overall survival were 3.5 and 5.6months, respectively. Major toxic effects included nausea, diarrhea andneurotoxicity. Although there was one possible treatment-related death, toxicity profiles were generally predictable and manageable. We conclude that <mark>irinotecanand cisplatin</mark> is an <mark>active combination</mark> for patients with <mark>metastatic gastriccancer</mark> in whom <mark>previous chemotherapy</mark> with 5-FU and taxanes has failed.</div></div><div class="stream output-id-18"><div class="output_subarea output_text"><pre>
+        </pre></div></div><div class="display_data output-id-19"><div class="output_subarea output_html rendered_html">"""Monomeric <mark>sarcosine oxidase</mark> (MSOX) is a <mark>flavoenzyme that catalyzes</mark> the <mark>oxidative demethylation</mark> of sarcosine (N-methylglycine) to <mark>yield glycine</mark></div></div><div class="stream output-id-20"><div class="output_subarea output_text"><pre>
+        </pre></div></div><div class="display_data output-id-21"><div class="output_subarea output_html rendered_html">We presented the <mark>tachinid fly</mark> <mark>Exorista japonica</mark> with <mark>moving host</mark> models: afreeze-dried larva of the <mark>common armyworm</mark> <mark>Mythimna separata</mark>, a <mark>black rubber</mark> tube,and a <mark>black rubber</mark> sheet, to examine the effects of size, curvature, and <mark>velocityon visual</mark> recognition of the host. The <mark>host models</mark> were moved around the fly on ametal arm driven by motor. The size of the larva, the velocity of movement, <mark>andthe length</mark> and diameter of the <mark>rubber tube</mark> were varied. During the presentationof the <mark>host model</mark>, fixation, approach, and examination behaviours of the flieswere recorded. The <mark>fly fixated</mark> on, approached, and examined the <mark>black rubber</mark> tubeas well as the freeze-dried larva. Furthermore, the <mark>fly detected</mark> the <mark>black rubbertube</mark> at a greater distance than the larva. The <mark>rubber tube</mark> elicited higher rates of approach and <mark>examination responses</mark> than the <mark>rubber sheet</mark>, suggesting thatcurvature affects the responses of the flies. The length, diameter, and velocity of <mark>host models</mark> had little effect on response rates of the flies. During hostpursuit, the <mark>fly appeared</mark> to walk towards the ends of the tube. These resultssuggest that the flies respond to the leading or trailing edges of a movingobject and ignore the <mark>length and diameter</mark> of the object.</div></div><div class="stream output-id-22"><div class="output_subarea output_text"><pre>
+        </pre></div></div><div class="display_data output-id-23"><div class="output_subarea output_html rendered_html">The <mark>literature dealing</mark> with the <mark>water conducting</mark> properties of <mark>sapwood xylem</mark> intrees is inconsistent in terminology, symbols and units. This has <mark>resulted fromconfusion</mark> in the use of either an analogy to Ohm's law or Darcy's <mark>law as thebasis</mark> for nomenclature. Ohm's <mark>law describes</mark> <mark>movement of electricity</mark> through aconductor, whereas Darcy's <mark>law describes</mark> movement of a fluid (liquid or gas)through a <mark>porous medium</mark>. However, it is generally not realized that, in <mark>theirfull notation</mark>, these laws are <mark>mathematically equivalent</mark>. Despite this, plantphysiologists have failed to agree on a convention for nomenclature. As a result,the study of <mark>water movement</mark> through <mark>sapwood xylem</mark> is confusing, especially forscientists entering the field. To improve clarity, we suggest the adoption of <mark>asingle nomenclature</mark> that can be used by all plant physiologists when <mark>describingwater movement</mark> in xylem. Darcy's law is an explicit hydraulic relationship andthe basis for established theories that describe three-dimensional saturated and unsaturated flow in <mark>porous media</mark>. We suggest, therefore, that Darcy's law is the more appropriate theoretical framework on which to <mark>base nomenclature</mark> describingsapwood hydraulics. Our <mark>proposed nomenclature</mark> is summarized in a <mark>table thatdescribes</mark> conventional terms, with their formulae, dimensions, units and symbols;the <mark>table also</mark> lists the many synonyms found in <mark>recent literature</mark> that describethe same concepts. Adoption of this proposal will require some changes in the <mark>useof terminology</mark>, but a common <mark>rigorous nomenclature</mark> is needed for efficient andclear communication among scientists.</div></div><div class="stream output-id-24"><div class="output_subarea output_text"><pre>
+        </pre></div></div><div class="display_data output-id-25"><div class="output_subarea output_html rendered_html">A <mark>novel approach</mark> to synthesize chitosan-O-isopropyl-5'-O-d4T <mark>monophosphateconjugate was developed</mark>. Chitosan-d4T <mark>monophosphate prodrug</mark> with <mark>aphosphoramidate linkage</mark> was <mark>efficiently synthesized</mark> through Atherton-Toddreaction. In <mark>vitro drug</mark> <mark>release studies</mark> in pH 1.1 and 7.4 indicated thatchitosan-O-isopropyl-5'-O-d4T <mark>monophosphate conjugate</mark> <mark>prefers to release</mark> the d4T 5'-(O-isopropyl)monophosphate than free d4T for a <mark>prolonged period</mark>. The resultssuggested that chitosan-O-isopropyl-5'-O-d4T <mark>monophosphate conjugate</mark> <mark>may be used</mark> as a <mark>sustained polymeric</mark> prodrug for <mark>improving therapy</mark> efficacy and <mark>reducing sideeffects</mark> in <mark>antiretroviral treatment</mark>.</div></div><div class="stream output-id-26"><div class="output_subarea output_text"><pre>
+        </pre></div></div><div class="display_data output-id-27"><div class="output_subarea output_html rendered_html">An HPLC-ESI-MS-MS method has been developed for the <mark>quantitative determination</mark> <mark>offour diterpenoids</mark>, dihydrotanshinone I, cryptotanshinone, tanshinone I, <mark>andtanshinone IIA</mark> in <mark>Radix Salviae</mark> Miltiorrhizae (RSM, the root of <mark>Salviamiltiorrhiza BGE</mark>.). The diterpenoids were <mark>chromatographically separated</mark> on a C18 <mark>HPLC column</mark>, and the quantification of these <mark>diterpenoids was based</mark> on thefragments of [M+H]+ under collision-activated conditions and in <mark>Selected ReactionMonitoring</mark> (SRM) mode. The <mark>quantitative method</mark> was validated, and the <mark>meanrecovery rates</mark> from <mark>fortified samples</mark> (n=5) of dihydrotanshinone I,cryptotanshinone, tanshinone I, and <mark>tanshinone IIA</mark> were 95.0%, 97.2%, 93.1%, and 95.9% with <mark>variation coefficient</mark> of 6.0%, 4.3%, 3.7%, and 4.2%, respectively. <mark>Theestablished method</mark> was <mark>successfully applied</mark> to the <mark>quality assessment</mark> of sevenbatches of <mark>RSM samples</mark> collected from <mark>different regions</mark> of China.</div></div><div class="stream output-id-28"><div class="output_subarea output_text"><pre>
+        </pre></div></div><div class="display_data output-id-29"><div class="output_subarea output_html rendered_html">The localizing and <mark>lateralizing values</mark> of eye and <mark>head ictal</mark> <mark>deviations duringfrontal</mark> <mark>lobe seizures</mark> are <mark>still matters</mark> of debate. In particular, no <mark>specificdata regarding</mark> the origin of <mark>ipsilateral head</mark> turning in <mark>frontal lobe</mark> <mark>seizuresare available</mark>. We report a patient with frontal <mark>lobe seizures</mark> <mark>associated withreproducible</mark>, early, <mark>ipsilateral head</mark> deviation, where imaging andvideo-stereo-electroencephalography data, as well as surgical outcome,demonstrated the fronto-polar and orbito-frontal origin of the epilepticdischarge. We conclude that early <mark>ipsilateral head</mark> deviation, in the context <mark>offrontal lobe</mark> epilepsy, raises the possibility of fronto-polar or orbito-frontalseizure onset.[Published with video sequences].</div></div><div class="stream output-id-30"><div class="output_subarea output_text"><pre>
+        </pre></div></div><div class="display_data output-id-31"><div class="output_subarea output_html rendered_html">OBJECTIVE: To evaluate the effectiveness and acceptability of <mark>expectantmanagement of induced</mark> and <mark>spontaneous first</mark> <mark>trimester incomplete</mark> abortion.METHODS: A <mark>prospective observational</mark> trial, <mark>conducted between June</mark> 2006 andNovember 2007, of 2 groups of <mark>patients diagnosed</mark> with an <mark>incomplete abortion</mark>: 66 patients who had <mark>received misoprostol</mark> for an <mark>induced abortion</mark> (group 1) and 30patients who had had a <mark>spontaneous abortion</mark> (group 2). <mark>Transvaginal ultrasoundwas</mark> <mark>performed weekly</mark>. The <mark>success rate</mark> (<mark>complete abortion</mark> without surgery), time to resolution, duration of bleeding and pelvic pain, rate of infection, number ofunscheduled hospital visits, and level of satisfaction with <mark>expectant management</mark> were recorded. RESULTS: The incidence of <mark>complete abortion</mark> was 86.4% and 82.1% ingroups 1 and 2 respectively at day 14 after diagnosis, and 100% in both <mark>groups atday</mark> 30 (<mark>two group</mark> 2 <mark>patients underwent</mark> curettage and were excluded from theanalysis). Both <mark>groups reported</mark> 100% satisfaction with <mark>expectant management</mark>,although over 90% of the women reported feeling anxious. CONCLUSION: Expectantmanagement for <mark>incomplete abortion</mark> in the <mark>first trimester</mark> after use ofmisoprostol or after <mark>spontaneous abortion</mark> may be practical and feasible, althoughit may increase anxiety associated with the <mark>impending abortion</mark>.</div></div><div class="stream output-id-32"><div class="output_subarea output_text"><pre>
+        </pre></div></div><div class="display_data output-id-33"><div class="output_subarea output_html rendered_html">For the construction of <mark>new combinatorial</mark> libraries, a <mark>lead compound</mark> was created by replacing the <mark>core structure</mark> of a <mark>hit compound</mark> discovered by <mark>screening forcytotoxic</mark> agents against a <mark>tumorigenic cell</mark> line. The <mark>newly designed</mark> <mark>compoundmaintained biological</mark> activity and <mark>allowed alternative</mark> <mark>library construction</mark> <mark>forantitumor drugs</mark>.</div></div><div class="stream output-id-34"><div class="output_subarea output_text"><pre>
+        </pre></div></div><div class="display_data output-id-35"><div class="output_subarea output_html rendered_html">We report the results of a screen for <mark>genetic association</mark> with <mark>urinary arsenicmetabolite</mark> levels in <mark>three arsenic</mark> <mark>metabolism candidate</mark> genes, PNP, GSTO, andCYT19, in 135 arsenic-exposed subjects from the <mark>Yaqui Valley</mark> in Sonora, Mexico,who were exposed to <mark>drinking water</mark> <mark>concentrations ranging</mark> from 5.5 to 43.3 ppb.We chose 23 <mark>polymorphic sites</mark> to test in the arsenic-exposed population. <mark>Initial phenotypes</mark> evaluated included the ratio of <mark>urinary inorganic</mark> arsenic(III) toinorganic arsenic(V) and the <mark>ratio of urinary</mark> dimethylarsenic(V) tomonomethylarsenic(V) (D:M). In the <mark>initial association</mark> screening, <mark>threepolymorphic sites</mark> in the CYT19 gene were significantly associated with D:M ratiosin the <mark>total population</mark>. Subsequent analysis of this <mark>association revealed</mark> <mark>thatthe association</mark> signal for the <mark>entire population</mark> was actually caused by anextremely <mark>strong association</mark> in only the children (7-11 years of age) betweenCYT19 genotype and D:M levels. With children removed from the analysis, nosignificant <mark>genetic association</mark> was observed in adults (18-79 years). Theexistence of a strong, developmentally regulated <mark>genetic association</mark> betweenCYT19 and <mark>arsenic metabolism</mark> carries import for both <mark>arsenic pharmacogenetics</mark> andarsenic toxicology, as well as for public health and governmental regulatoryofficials.</div></div><div class="stream output-id-36"><div class="output_subarea output_text"><pre>
+        </pre></div></div><div class="display_data output-id-37"><div class="output_subarea output_html rendered_html"><mark>Intraparenchymal pericatheter</mark> cyst is <mark>rarely reported</mark>. Obstruction in <mark>theventriculoperitoneal shunt</mark> leads to recurrence of hydrocephalus, signs of <mark>raised intracranial</mark> pressure and <mark>possibly secondary</mark> complications. Blockage of <mark>thedistal catheter</mark> can result, unusually, in <mark>cerebrospinal fluid</mark> oedema and/orintraparenchymal <mark>cyst around</mark> the <mark>ventricular catheter</mark> which <mark>may produce</mark> <mark>focalneurological deficit</mark>. We report two cases of <mark>distal catheter</mark> <mark>obstruction withformation</mark> of cysts causing local mass effect and <mark>neurological deficit</mark>. Bothpatients had their <mark>shunt system</mark> replaced, which led to resolution of the <mark>cyst andclinical</mark> improvement. One patient had endoscopic exploration of the <mark>cyst whichconfirmed</mark> the diagnosis made on <mark>imaging studies</mark>. Magnetic resonance imaging wasmore helpful than computed tomography in differentiating between <mark>oedema andcollection</mark> of <mark>cystic fluid</mark>. Early recognition and treatment of <mark>pericatheter cyst</mark> in the presence of <mark>distal shunt</mark> obstruction can lead to complete resolution ofsymptoms and signs.</div></div><div class="stream output-id-38"><div class="output_subarea output_text"><pre>
+        </pre></div></div><div class="display_data output-id-39"><div class="output_subarea output_html rendered_html">It is <mark>known that patients</mark> with Klinefelter's <mark>syndrome are inclined</mark> to <mark>developconcomitant malignant</mark> tumours, as well as <mark>extragonadal germ</mark> <mark>cell tumours</mark>. Theassociation of a <mark>primary spinal</mark> germinoma in a patient with Klinefelter'ssyndrome is reported for the <mark>first time</mark>, and the coincidence of <mark>elevatedgonadotropin levels</mark> and <mark>oncogenesis is discussed</mark>.</div></div><div class="stream output-id-40"><div class="output_subarea output_text"><pre>
+        </pre></div></div></div><span id="output-footer"></span></div>
+    ''', unsafe_allow_html=True)
+# Conclusion
+st.markdown('<div class="sub-title">Conclusion</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>In this demo, we demonstrated how to extract keywords from texts using the YakeKeywordExtraction annotator in Spark NLP. We provided step-by-step instructions on setting up the environment, creating a pipeline, and running the keyword extraction. Additionally, we explored how to highlight extracted keywords in the text.</p>
+</div>
+""", unsafe_allow_html=True)
+# References and Additional Information
+st.markdown('<div class="sub-title">For additional information, please check the following references.</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <ul>
+        <li>Documentation <a class="link" href="https://nlp.johnsnowlabs.com/docs/en/annotators#yakekeywordextraction" target="_blank" rel="noopener">YakeKeywordExtraction</a></li>
+        <li>Python keyword extraction: Docs about are <a class="link" href="https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/keyword_extraction/yake_keyword_extraction/index.html" target="_blank" rel="noopener">here</a></li>
+        <li>Scala Docs: <a class="link" href="https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/keyword/yake/YakeKeywordExtraction.html">YakeKeywordExtraction</a></li>
+        <li>For extended examples of usage, see the <a class="link" href="https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/8.Keyword_Extraction_YAKE.ipynb" target="_blank" rel="noopener nofollow">Spark NLP Workshop repository</a>.</li>
+        <li>Reference Paper: <a class="link" href="https://www.sciencedirect.com/science/article/abs/pii/S0020025519308588" target="_blank" rel="noopener nofollow">YAKE! Keyword extraction from single documents using multiple local features</a></li>
+        </ul>
+</div>
+""", unsafe_allow_html=True)
+st.markdown('<div class="sub-title">Community & Support</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <ul>
+        <li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>
+        <li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>
+        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>
+        <li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>
+        <li><a class="link" href="https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos" target="_blank">YouTube</a>: Video tutorials</li>
+    </ul>
+</div>
+""", unsafe_allow_html=True)