import streamlit as st import pandas as pd # Custom CSS for better styling st.markdown(""" """, unsafe_allow_html=True) # Introduction st.markdown('
Welcome to the Spark NLP Keyword Extraction Demo App! Keyword extraction is a technique in natural language processing (NLP) that involves automatically identifying the most important words or phrases in a document or corpus. Keywords extracted from a text can be used in a variety of ways, including:
This app demonstrates how to use Spark NLP's YakeKeywordExtraction annotator to perform keyword extraction using Python.
Extracting keywords from texts has become difficult for individuals and organizations as the complexity and volume of information have grown. The need to automate this task so that text can be processed promptly and adequately has led to the emergence of automatic keyword extraction tools. NLP and Python libraries help in the process.
Yake! is a novel feature-based system for multi-lingual keyword extraction, which supports texts of different sizes, domains, or languages. Unlike other approaches, Yake! does not rely on dictionaries or thesauri, nor is it trained against any corpora. Instead, it follows an unsupervised approach which builds upon features extracted from the text.
Here’s how you can implement keyword extraction using the YakeKeywordExtraction annotator in Spark NLP:
', unsafe_allow_html=True) # Setup Instructions st.markdown('To install Spark NLP and extract keywords in Python, simply use your favorite package manager (conda, pip, etc.). For example:
', unsafe_allow_html=True) st.code(""" pip install spark-nlp pip install pyspark """, language="bash") st.markdown("Then, import Spark NLP and start a Spark session:
", unsafe_allow_html=True) st.code(""" import sparknlp # Start Spark Session spark = sparknlp.start() """, language='python') # Keyword Extraction Example st.markdown('The code snippet demonstrates how to set up a pipeline in Spark NLP to perform keyword extraction on text data using the YakeKeywordExtraction annotator. The resulting DataFrame contains the keywords and their corresponding scores.
""", unsafe_allow_html=True) # Highlighting Keywords in a Text st.markdown('In addition to getting the keywords as a dataframe, it is also possible to highlight the extracted keywords in the text.
In this example, a dataset of 7537 texts were used — samples from the PubMed, which is a free resource supporting the search and retrieval of biomedical and life sciences literature.
""", unsafe_allow_html=True) st.code(""" !wget -q https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed_sample_text_small.csv df = spark.read\\ .option("header", "true")\\ .csv("pubmed_sample_text_small.csv")\\ df.show(truncate=False) """, language='python') st.text(''' +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |text | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. | |BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes. METHODS: Vinorelbinewas administered at a dose level of 25 mg/m(2) intravenously on days 1 and 8 of a3 week cycle. Patients were given three or more cycles in the absence of tumorprogression. A maximum of nine cycles were administered. RESULTS: The responserate in 50 evaluable patients was 20.0% (10 out of 50; 95% confidence interval,10.0-33.7%). Responders plus those who had minor response (MR) or no change (NC) accounted for 58.0% [10 partial responses (PRs) + one MR + 18 NCs out of 50]. TheKaplan-Meier estimate (50% point) of time to progression (TTP) was 115.0 days.The response rate in the visceral organs was 17.3% (nine PRs out of 52). Themajor toxicity was myelosuppression, which was reversible and did not requirediscontinuation of treatment. CONCLUSION: The results of this study show thatvinorelbine monotherapy is useful in patients with advanced or recurrent breastcancer previously exposed to both anthracyclines and taxanes. | |OBJECTIVE: To investigate the relationship between preoperative atrialfibrillation and early and late clinical outcomes following cardiac surgery.METHODS: A retrospective cohort including all consecutive coronary artery bypass graft and/or valve surgery patients between 1995 and 2005 was identified (n =9796). No patient had a concomitant surgical AF ablation. The association betweenpreoperative atrial fibrillation and in-hospital outcomes was examined. We alsodetermined late death and cardiovascular-related re-hospitalization by linking toadministrative health databases. Median follow-up was 2.9 years (maximum 11years). RESULTS: The prevalence of preoperative atrial fibrillation was 11.3% (n = 1105), ranging from 7.2% in isolated CABG to 30% in valve surgery. In-hospital mortality, stroke, and renal failure were more common in atrial fibrillationpatients (all p < 0.0001), although the association between atrial fibrillationand mortality was not statistically significant in multivariate logisticregression. Longitudinal analyses showed that preoperative atrial fibrillationwas associated with decreased event-free survival (adjusted hazard ratio 1.55,95% confidence interval 1.42-1.70, p < 0.0001). CONCLUSIONS: Preoperative atrial fibrillation is associated with increased late mortality and recurrentcardiovascular events post-cardiac surgery. Effective management strategies foratrial fibrillation need to be explored and may provide an opportunity to improvethe long-term outcomes of cardiac surgical patients. | |Combined EEG/fMRI recording has been used to localize the generators of EEGevents and to identify subject state in cognitive studies and is of increasinginterest. However, the large EEG artifacts induced during fMRI have precludedsimultaneous EEG and fMRI recording, restricting study design. Removing thisartifact is difficult, as it normally exceeds EEG significantly and containscomponents in the EEG frequency range. We have developed a recording system andan artifact reduction method that reduce this artifact effectively. The recordingsystem has large dynamic range to capture both low-amplitude EEG and largeimaging artifact without distortion (resolution 2 microV, range 33.3 mV), 5-kHzsampling, and low-pass filtering prior to the main gain stage. Imaging artifactis reduced by subtracting an averaged artifact waveform, followed by adaptivenoise cancellation to reduce any residual artifact. This method was validated in recordings from five subjects using periodic and continuous fMRI sequences.Spectral analysis revealed differences of only 10 to 18% between EEG recorded in the scanner without fMRI and the corrected EEG. Ninety-nine percent of spikewaves (median 74 microV) added to the recordings were identified in the correctedEEG compared to 12% in the uncorrected EEG. The median noise after artifactreduction was 8 microV. All these measures indicate that most of the artifact wasremoved, with minimal EEG distortion. Using this recording system and artifactreduction method, we have demonstrated that simultaneous EEG/fMRI studies are forthe first time possible, extending the scope of EEG/fMRI studies considerably. | |Kohlschutter syndrome is a rare neurodegenerative disorder presenting withintractable seizures, developmental regression and characteristic hypoplasticdental enamel indicative of amelogenesis imperfecta. We report a new family with two affected siblings. | |Statistical analysis of neuroimages is commonly approached with intergroupcomparisons made by repeated application of univariate or multivariate testsperformed on the set of the regions of interest sampled in the acquired images.The use of such large numbers of tests requires application of techniques forcorrection for multiple comparisons. Standard multiple comparison adjustments(such as the Bonferroni) may be overly conservative when data are correlatedand/or not normally distributed. Resampling-based step-down procedures thatsuccessfully account for unknown correlation structures in the data have recentlybeen introduced. We combined resampling step-down procedures with the MinimumVariance Adaptive method, which allows selection of an optimal test statisticfrom a predefined class of statistics for the data under analysis. As shown insimulation studies and analysis of autoradiographic data, the combined technique exhibits a significant increase in statistical power, even for small sample sizes(n = 8, 9, 10). | |The synthetic DOX-LNA conjugate was characterized by proton nuclear magneticresonance and mass spectrometry. In addition, the purity of the conjugate wasanalyzed by reverse-phase high-performance liquid chromatography. The cellularuptake, intracellular distribution, and cytotoxicity of DOX-LNA were assessed by flow cytometry, fluorescence microscopy, liquid chromatography/electrosprayionization tandem mass spectrometry, and the tetrazolium dye assay using the invitro cell models. The DOX-LNA conjugate showed substantially highertumor-specific cytotoxicity compared with DOX. | |Our objective was to compare three different methods of blood pressuremeasurement through the results of a controlled study aimed at comparing theantihypertensive effects of trandolapril and losartan. Two hundred andtwenty-nine hypertensive patients were randomized in a double-blind parallelgroup study. After a 3-week placebo period, they received either 2 mgtrandolapril or 50 mg losartan once daily for 6 weeks. At the end of both placeboand active treatment periods, three methods of blood pressure measurement wereused: a) office blood pressure (three consecutive measurements); b) home selfblood pressure measurements (SBPM), consisting of three consecutive measurements performed at home in the morning and in the evening for 7 consecutive days; andc) ambulatory blood pressure measurements (ABPM), 24-h BP recordings with threemeasurements per hour. Of the 229 patients, 199 (87%) performed at least 12 validSBPM measurements during both placebo and treatment periods, whereas only 160(70%) performed good quality 24-h ABPM recordings during both periods (P <.0001). One hundred-forty patients performed the three methods of measurementwell. At baseline and with treatment, agreement between office measurements andABPM or SBPM was weak. Conversely, there was a good agreement between ABPM andSBPM. The mean difference (SBP/DBP) between ABPM and SBPM was 4.6 +/- 10.4/3.5+/- 7.1 at baseline and 3.5 +/- 10.0/4.0 +/- 7.0 at the end of the treatmentperiod. The correlation between SBPM and ABPM expressed by the r coefficient and the P values were the following: at baseline 0.79/0.70 (< 0.001/< .0001), withactive treatment 0.74/0.69 (0.0001/.0001). Hourly and 24-h reproducibility ofblood pressure response was quantified by the standard deviation of BP response. Compared with office blood pressure, both global and hourly SBPM responsesexhibited a lower standard deviation. Hourly reproducibility of SBPM response(10.8 mm Hg/6.9 mm Hg) was lower than hourly reproducibility of ABPM response(15.6 mm Hg/11.9 mm Hg). In conclusion, SBPM was easier to perform than ABPM.There was a good agreement between these two methods whereas concordance between SBPM or ABPM and office measurements was weak. As hourly reproducibility of SBPM response is better than reproducibility of both hourly ABPM and office BPresponse, SBPM seems to be the most appropriate method for evaluating residualantihypertensive effect.| |We conducted a phase II study to assess the efficacy and tolerability ofirinotecan and cisplatin as salvage chemotherapy in patients with advancedgastric adenocarcinoma, progressing after both 5-fluorouracil (5-FU)- andtaxane-containing regimen. Patients with measurable metastatic gastric cancer,progressive after previous chemotherapy that consisted either of a 5-FU-basedregimen followed by second-line chemotherapy containing taxanes or a 5-FU andtaxane combination were treated with irinotecan and cisplatin. Irinotecan 70mg/m(2) was administered on day 1 and day 15; cisplatin 70 mg/m(2) wasadministered on day 1. Treatment was repeated every 4 weeks. For 28 patientsregistered, a total of 94 chemotherapy cycles were administered. The patients'median age was 51 years and 27 (96%) had an ECOG performance status of 1 orbelow. In an intent-to-treat analysis, seven patients (25%) achieved a partialresponse, which maintained for 6.3 months (95% confidence interval 6.2-6.4months). The median progression-free and overall survival were 3.5 and 5.6months, respectively. Major toxic effects included nausea, diarrhea andneurotoxicity. Although there was one possible treatment-related death, toxicity profiles were generally predictable and manageable. We conclude that irinotecanand cisplatin is an active combination for patients with metastatic gastriccancer in whom previous chemotherapy with 5-FU and taxanes has failed. | |"""Monomeric sarcosine oxidase (MSOX) is a flavoenzyme that catalyzes the oxidative demethylation of sarcosine (N-methylglycine) to yield glycine | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ''') st.code(""" import re from pyspark.sql import functions as F from pyspark.sql.types import StringType from pyspark.sql.functions import udf, array_distinct from IPython.display import HTML, display # Fit and transform the dataframe to get the predictions result = yake_pipeline.fit(df).transform(df) result = result.withColumn('unique_keywords', F.array_distinct("keywords.result")) def highlight(text, keywords): for k in keywords: # Escape HTML characters in the keywords k = re.escape(k) # Use tag to highlight text = re.sub(r'(\b%s\b)' % k, r'\1', text, flags=re.IGNORECASE) return text highlight_udf = udf(highlight, StringType()) result = result.withColumn("highlighted_keywords", highlight_udf('text', 'unique_keywords')) for r in result.select("highlighted_keywords").limit(20).collect(): # Display the HTML content display(HTML(r.highlighted_keywords)) print("\\n\\n") """, language='python') with st.expander("View Output"): st.markdown('''In this demo, we demonstrated how to extract keywords from texts using the YakeKeywordExtraction annotator in Spark NLP. We provided step-by-step instructions on setting up the environment, creating a pipeline, and running the keyword extraction. Additionally, we explored how to highlight extracted keywords in the text.