Example Usage in Python

Here’s how you can implement keyword extraction using the YakeKeywordExtraction annotator in Spark NLP:

Setup

To install Spark NLP and extract keywords in Python, simply use your favorite package manager (conda, pip, etc.). For example:

Then, import Spark NLP and start a Spark session:

Example Usage: Keyword Extraction with YakeKeywordExtraction

The code snippet demonstrates how to set up a pipeline in Spark NLP to perform keyword extraction on text data using the YakeKeywordExtraction annotator. The resulting DataFrame contains the keywords and their corresponding scores.

Highlighting Keywords in a Text

In addition to getting the keywords as a dataframe, it is also possible to highlight the extracted keywords in the text.

In this example, a dataset of 7537 texts were used — samples from the PubMed, which is a free resource supporting the search and retrieval of biomedical and life sciences literature.

tag to highlight text = re.sub(r'(\b%s\b)' % k, r'\1', text, flags=re.IGNORECASE) return text highlight_udf = udf(highlight, StringType()) result = result.withColumn("highlighted_keywords", highlight_udf('text', 'unique_keywords')) for r in result.select("highlighted_keywords").limit(20).collect(): # Display the HTML content display(HTML(r.highlighted_keywords)) print("\\n\\n") """, language='python') with st.expander("View Output"): st.markdown('''

The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.

BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes. METHODS: Vinorelbinewas administered at a dose level of 25 mg/m(2) intravenously on days 1 and 8 of a3 week cycle. Patients were given three or more cycles in the absence of tumorprogression. A maximum of nine cycles were administered. RESULTS: The responserate in 50 evaluable patients was 20.0% (10 out of 50; 95% confidence interval,10.0-33.7%). Responders plus those who had minor response (MR) or no change (NC) accounted for 58.0% [10 partial responses (PRs) + one MR + 18 NCs out of 50]. TheKaplan-Meier estimate (50% point) of time to progression (TTP) was 115.0 days.The response rate in the visceral organs was 17.3% (nine PRs out of 52). Themajor toxicity was myelosuppression, which was reversible and did not requirediscontinuation of treatment. CONCLUSION: The results of this study show thatvinorelbine monotherapy is useful in patients with advanced or recurrent breastcancer previously exposed to both anthracyclines and taxanes.

OBJECTIVE: To investigate the relationship between preoperative atrialfibrillation and early and late clinical outcomes following cardiac surgery.METHODS: A retrospective cohort including all consecutive coronary artery bypass graft and/or valve surgery patients between 1995 and 2005 was identified (n =9796). No patient had a concomitant surgical AF ablation. The association betweenpreoperative atrial fibrillation and in-hospital outcomes was examined. We alsodetermined late death and cardiovascular-related re-hospitalization by linking toadministrative health databases. Median follow-up was 2.9 years (maximum 11years). RESULTS: The prevalence of preoperative atrial fibrillation was 11.3% (n = 1105), ranging from 7.2% in isolated CABG to 30% in valve surgery. In-hospital mortality, stroke, and renal failure were more common in atrial fibrillationpatients (all p < 0.0001), although the association between atrial fibrillationand mortality was not statistically significant in multivariate logisticregression. Longitudinal analyses showed that preoperative atrial fibrillationwas associated with decreased event-free survival (adjusted hazard ratio 1.55,95% confidence interval 1.42-1.70, p < 0.0001). CONCLUSIONS: Preoperative atrial fibrillation is associated with increased late mortality and recurrentcardiovascular events post-cardiac surgery. Effective management strategies foratrial fibrillation need to be explored and may provide an opportunity to improvethe long-term outcomes of cardiac surgical patients.

Combined EEG/fMRI recording has been used to localize the generators of EEGevents and to identify subject state in cognitive studies and is of increasinginterest. However, the large EEG artifacts induced during fMRI have precludedsimultaneous EEG and fMRI recording, restricting study design. Removing thisartifact is difficult, as it normally exceeds EEG significantly and containscomponents in the EEG frequency range. We have developed a recording system andan artifact reduction method that reduce this artifact effectively. The recordingsystem has large dynamic range to capture both low-amplitude EEG and largeimaging artifact without distortion (resolution 2 microV, range 33.3 mV), 5-kHzsampling, and low-pass filtering prior to the main gain stage. Imaging artifactis reduced by subtracting an averaged artifact waveform, followed by adaptivenoise cancellation to reduce any residual artifact. This method was validated in recordings from five subjects using periodic and continuous fMRI sequences.Spectral analysis revealed differences of only 10 to 18% between EEG recorded in the scanner without fMRI and the corrected EEG. Ninety-nine percent of spikewaves (median 74 microV) added to the recordings were identified in the correctedEEG compared to 12% in the uncorrected EEG. The median noise after artifactreduction was 8 microV. All these measures indicate that most of the artifact wasremoved, with minimal EEG distortion. Using this recording system and artifactreduction method, we have demonstrated that simultaneous EEG/fMRI studies are forthe first time possible, extending the scope of EEG/fMRI studies considerably.

Kohlschutter syndrome is a rare neurodegenerative disorder presenting withintractable seizures, developmental regression and characteristic hypoplasticdental enamel indicative of amelogenesis imperfecta. We report a new family with two affected siblings.

Statistical analysis of neuroimages is commonly approached with intergroupcomparisons made by repeated application of univariate or multivariate testsperformed on the set of the regions of interest sampled in the acquired images.The use of such large numbers of tests requires application of techniques forcorrection for multiple comparisons. Standard multiple comparison adjustments(such as the Bonferroni) may be overly conservative when data are correlatedand/or not normally distributed. Resampling-based step-down procedures thatsuccessfully account for unknown correlation structures in the data have recentlybeen introduced. We combined resampling step-down procedures with the MinimumVariance Adaptive method, which allows selection of an optimal test statisticfrom a predefined class of statistics for the data under analysis. As shown insimulation studies and analysis of autoradiographic data, the combined technique exhibits a significant increase in statistical power, even for small sample sizes(n = 8, 9, 10).

The synthetic DOX-LNA conjugate was characterized by proton nuclear magneticresonance and mass spectrometry. In addition, the purity of the conjugate wasanalyzed by reverse-phase high-performance liquid chromatography. The cellularuptake, intracellular distribution, and cytotoxicity of DOX-LNA were assessed by flow cytometry, fluorescence microscopy, liquid chromatography/electrosprayionization tandem mass spectrometry, and the tetrazolium dye assay using the invitro cell models. The DOX-LNA conjugate showed substantially highertumor-specific cytotoxicity compared with DOX.

Our objective was to compare three different methods of blood pressuremeasurement through the results of a controlled study aimed at comparing theantihypertensive effects of trandolapril and losartan. Two hundred andtwenty-nine hypertensive patients were randomized in a double-blind parallelgroup study. After a 3-week placebo period, they received either 2 mgtrandolapril or 50 mg losartan once daily for 6 weeks. At the end of both placeboand active treatment periods, three methods of blood pressure measurement wereused: a) office blood pressure (three consecutive measurements); b) home selfblood pressure measurements (SBPM), consisting of three consecutive measurements performed at home in the morning and in the evening for 7 consecutive days; andc) ambulatory blood pressure measurements (ABPM), 24-h BP recordings with threemeasurements per hour. Of the 229 patients, 199 (87%) performed at least 12 validSBPM measurements during both placebo and treatment periods, whereas only 160(70%) performed good quality 24-h ABPM recordings during both periods (P <.0001). One hundred-forty patients performed the three methods of measurementwell. At baseline and with treatment, agreement between office measurements andABPM or SBPM was weak. Conversely, there was a good agreement between ABPM andSBPM. The mean difference (SBP/DBP) between ABPM and SBPM was 4.6 +/- 10.4/3.5+/- 7.1 at baseline and 3.5 +/- 10.0/4.0 +/- 7.0 at the end of the treatmentperiod. The correlation between SBPM and ABPM expressed by the r coefficient and the P values were the following: at baseline 0.79/0.70 (< 0.001/< .0001), withactive treatment 0.74/0.69 (0.0001/.0001). Hourly and 24-h reproducibility ofblood pressure response was quantified by the standard deviation of BP response. Compared with office blood pressure, both global and hourly SBPM responsesexhibited a lower standard deviation. Hourly reproducibility of SBPM response(10.8 mm Hg/6.9 mm Hg) was lower than hourly reproducibility of ABPM response(15.6 mm Hg/11.9 mm Hg). In conclusion, SBPM was easier to perform than ABPM.There was a good agreement between these two methods whereas concordance between SBPM or ABPM and office measurements was weak. As hourly reproducibility of SBPM response is better than reproducibility of both hourly ABPM and office BPresponse, SBPM seems to be the most appropriate method for evaluating residualantihypertensive effect.

We conducted a phase II study to assess the efficacy and tolerability ofirinotecan and cisplatin as salvage chemotherapy in patients with advancedgastric adenocarcinoma, progressing after both 5-fluorouracil (5-FU)- andtaxane-containing regimen. Patients with measurable metastatic gastric cancer,progressive after previous chemotherapy that consisted either of a 5-FU-basedregimen followed by second-line chemotherapy containing taxanes or a 5-FU andtaxane combination were treated with irinotecan and cisplatin. Irinotecan 70mg/m(2) was administered on day 1 and day 15; cisplatin 70 mg/m(2) wasadministered on day 1. Treatment was repeated every 4 weeks. For 28 patientsregistered, a total of 94 chemotherapy cycles were administered. The patients'median age was 51 years and 27 (96%) had an ECOG performance status of 1 orbelow. In an intent-to-treat analysis, seven patients (25%) achieved a partialresponse, which maintained for 6.3 months (95% confidence interval 6.2-6.4months). The median progression-free and overall survival were 3.5 and 5.6months, respectively. Major toxic effects included nausea, diarrhea andneurotoxicity. Although there was one possible treatment-related death, toxicity profiles were generally predictable and manageable. We conclude that irinotecanand cisplatin is an active combination for patients with metastatic gastriccancer in whom previous chemotherapy with 5-FU and taxanes has failed.

"""Monomeric sarcosine oxidase (MSOX) is a flavoenzyme that catalyzes the oxidative demethylation of sarcosine (N-methylglycine) to yield glycine

We presented the tachinid fly Exorista japonica with moving host models: afreeze-dried larva of the common armyworm Mythimna separata, a black rubber tube,and a black rubber sheet, to examine the effects of size, curvature, and velocityon visual recognition of the host. The host models were moved around the fly on ametal arm driven by motor. The size of the larva, the velocity of movement, andthe length and diameter of the rubber tube were varied. During the presentationof the host model, fixation, approach, and examination behaviours of the flieswere recorded. The fly fixated on, approached, and examined the black rubber tubeas well as the freeze-dried larva. Furthermore, the fly detected the black rubbertube at a greater distance than the larva. The rubber tube elicited higher rates of approach and examination responses than the rubber sheet, suggesting thatcurvature affects the responses of the flies. The length, diameter, and velocity of host models had little effect on response rates of the flies. During hostpursuit, the fly appeared to walk towards the ends of the tube. These resultssuggest that the flies respond to the leading or trailing edges of a movingobject and ignore the length and diameter of the object.

The literature dealing with the water conducting properties of sapwood xylem intrees is inconsistent in terminology, symbols and units. This has resulted fromconfusion in the use of either an analogy to Ohm's law or Darcy's law as thebasis for nomenclature. Ohm's law describes movement of electricity through aconductor, whereas Darcy's law describes movement of a fluid (liquid or gas)through a porous medium. However, it is generally not realized that, in theirfull notation, these laws are mathematically equivalent. Despite this, plantphysiologists have failed to agree on a convention for nomenclature. As a result,the study of water movement through sapwood xylem is confusing, especially forscientists entering the field. To improve clarity, we suggest the adoption of asingle nomenclature that can be used by all plant physiologists when describingwater movement in xylem. Darcy's law is an explicit hydraulic relationship andthe basis for established theories that describe three-dimensional saturated and unsaturated flow in porous media. We suggest, therefore, that Darcy's law is the more appropriate theoretical framework on which to base nomenclature describingsapwood hydraulics. Our proposed nomenclature is summarized in a table thatdescribes conventional terms, with their formulae, dimensions, units and symbols;the table also lists the many synonyms found in recent literature that describethe same concepts. Adoption of this proposal will require some changes in the useof terminology, but a common rigorous nomenclature is needed for efficient andclear communication among scientists.

A novel approach to synthesize chitosan-O-isopropyl-5'-O-d4T monophosphateconjugate was developed. Chitosan-d4T monophosphate prodrug with aphosphoramidate linkage was efficiently synthesized through Atherton-Toddreaction. In vitro drug release studies in pH 1.1 and 7.4 indicated thatchitosan-O-isopropyl-5'-O-d4T monophosphate conjugate prefers to release the d4T 5'-(O-isopropyl)monophosphate than free d4T for a prolonged period. The resultssuggested that chitosan-O-isopropyl-5'-O-d4T monophosphate conjugate may be used as a sustained polymeric prodrug for improving therapy efficacy and reducing sideeffects in antiretroviral treatment.

An HPLC-ESI-MS-MS method has been developed for the quantitative determination offour diterpenoids, dihydrotanshinone I, cryptotanshinone, tanshinone I, andtanshinone IIA in Radix Salviae Miltiorrhizae (RSM, the root of Salviamiltiorrhiza BGE.). The diterpenoids were chromatographically separated on a C18 HPLC column, and the quantification of these diterpenoids was based on thefragments of [M+H]+ under collision-activated conditions and in Selected ReactionMonitoring (SRM) mode. The quantitative method was validated, and the meanrecovery rates from fortified samples (n=5) of dihydrotanshinone I,cryptotanshinone, tanshinone I, and tanshinone IIA were 95.0%, 97.2%, 93.1%, and 95.9% with variation coefficient of 6.0%, 4.3%, 3.7%, and 4.2%, respectively. Theestablished method was successfully applied to the quality assessment of sevenbatches of RSM samples collected from different regions of China.

The localizing and lateralizing values of eye and head ictal deviations duringfrontal lobe seizures are still matters of debate. In particular, no specificdata regarding the origin of ipsilateral head turning in frontal lobe seizuresare available. We report a patient with frontal lobe seizures associated withreproducible, early, ipsilateral head deviation, where imaging andvideo-stereo-electroencephalography data, as well as surgical outcome,demonstrated the fronto-polar and orbito-frontal origin of the epilepticdischarge. We conclude that early ipsilateral head deviation, in the context offrontal lobe epilepsy, raises the possibility of fronto-polar or orbito-frontalseizure onset.[Published with video sequences].

OBJECTIVE: To evaluate the effectiveness and acceptability of expectantmanagement of induced and spontaneous first trimester incomplete abortion.METHODS: A prospective observational trial, conducted between June 2006 andNovember 2007, of 2 groups of patients diagnosed with an incomplete abortion: 66 patients who had received misoprostol for an induced abortion (group 1) and 30patients who had had a spontaneous abortion (group 2). Transvaginal ultrasoundwas performed weekly. The success rate (complete abortion without surgery), time to resolution, duration of bleeding and pelvic pain, rate of infection, number ofunscheduled hospital visits, and level of satisfaction with expectant management were recorded. RESULTS: The incidence of complete abortion was 86.4% and 82.1% ingroups 1 and 2 respectively at day 14 after diagnosis, and 100% in both groups atday 30 (two group 2 patients underwent curettage and were excluded from theanalysis). Both groups reported 100% satisfaction with expectant management,although over 90% of the women reported feeling anxious. CONCLUSION: Expectantmanagement for incomplete abortion in the first trimester after use ofmisoprostol or after spontaneous abortion may be practical and feasible, althoughit may increase anxiety associated with the impending abortion.

For the construction of new combinatorial libraries, a lead compound was created by replacing the core structure of a hit compound discovered by screening forcytotoxic agents against a tumorigenic cell line. The newly designed compoundmaintained biological activity and allowed alternative library construction forantitumor drugs.

We report the results of a screen for genetic association with urinary arsenicmetabolite levels in three arsenic metabolism candidate genes, PNP, GSTO, andCYT19, in 135 arsenic-exposed subjects from the Yaqui Valley in Sonora, Mexico,who were exposed to drinking water concentrations ranging from 5.5 to 43.3 ppb.We chose 23 polymorphic sites to test in the arsenic-exposed population. Initial phenotypes evaluated included the ratio of urinary inorganic arsenic(III) toinorganic arsenic(V) and the ratio of urinary dimethylarsenic(V) tomonomethylarsenic(V) (D:M). In the initial association screening, threepolymorphic sites in the CYT19 gene were significantly associated with D:M ratiosin the total population. Subsequent analysis of this association revealed thatthe association signal for the entire population was actually caused by anextremely strong association in only the children (7-11 years of age) betweenCYT19 genotype and D:M levels. With children removed from the analysis, nosignificant genetic association was observed in adults (18-79 years). Theexistence of a strong, developmentally regulated genetic association betweenCYT19 and arsenic metabolism carries import for both arsenic pharmacogenetics andarsenic toxicology, as well as for public health and governmental regulatoryofficials.

Intraparenchymal pericatheter cyst is rarely reported. Obstruction in theventriculoperitoneal shunt leads to recurrence of hydrocephalus, signs of raised intracranial pressure and possibly secondary complications. Blockage of thedistal catheter can result, unusually, in cerebrospinal fluid oedema and/orintraparenchymal cyst around the ventricular catheter which may produce focalneurological deficit. We report two cases of distal catheter obstruction withformation of cysts causing local mass effect and neurological deficit. Bothpatients had their shunt system replaced, which led to resolution of the cyst andclinical improvement. One patient had endoscopic exploration of the cyst whichconfirmed the diagnosis made on imaging studies. Magnetic resonance imaging wasmore helpful than computed tomography in differentiating between oedema andcollection of cystic fluid. Early recognition and treatment of pericatheter cyst in the presence of distal shunt obstruction can lead to complete resolution ofsymptoms and signs.

It is known that patients with Klinefelter's syndrome are inclined to developconcomitant malignant tumours, as well as extragonadal germ cell tumours. Theassociation of a primary spinal germinoma in a patient with Klinefelter'ssyndrome is reported for the first time, and the coincidence of elevatedgonadotropin levels and oncogenesis is discussed.

''', unsafe_allow_html=True) # Conclusion st.markdown('

Conclusion

', unsafe_allow_html=True) st.markdown("""

In this demo, we demonstrated how to extract keywords from texts using the YakeKeywordExtraction annotator in Spark NLP. We provided step-by-step instructions on setting up the environment, creating a pipeline, and running the keyword extraction. Additionally, we explored how to highlight extracted keywords in the text.

""", unsafe_allow_html=True) # References and Additional Information st.markdown('

For additional information, please check the following references.

', unsafe_allow_html=True) st.markdown("""

Documentation YakeKeywordExtraction
Python keyword extraction: Docs about are here
Scala Docs: YakeKeywordExtraction
For extended examples of usage, see the Spark NLP Workshop repository.
Reference Paper: YAKE! Keyword extraction from single documents using multiple local features

""", unsafe_allow_html=True) st.markdown('

Community & Support

', unsafe_allow_html=True) st.markdown("""

Official Website: Documentation and examples
Slack: Live discussion with the community and team
GitHub: Bug reports, feature requests, and contributions
Medium: Spark NLP articles
YouTube: Video tutorials

""", unsafe_allow_html=True)