# Navigating Scientific Papers in 2D Scatter Plots
A simple way to get a glimpse of how scientific papers are related to one another is to plot their projections on a 2D plain, similar to https://huggingface.co/spaces/gwf-uwaterloo/acl-spectrum.

This notebook provides steps to visualize papers from the [ACL Anthology](https://aclanthology.org/). For this purpose, we first embed papers using a model (e.g. [spectre2](https://huggingface.co/allenai/specter2_base) by default) into dense representations. After clustering them, we apply t-SNE to project them into 2 dimensions for visualization.

**Before running this colab, make sure the runtime type is set to GPU.** We check the availability of GPUs in the "Checks" section.

The plot will be generated using [plotly](https://plotly.com/python/getting-started/).

In [None]:
# @title XML file name to download from acl-anthology github page
FILE_NAME = '2023.acl.xml' # @param {type:"string"}

In [None]:
# @title Model name from huggingface
MODEL_NAME = 'allenai/specter2_base' # @param {type:"string"}

ADAPTER_NAME = "" # @param {type:"string"}

In [None]:
# @title Inference args
BATCH_SIZE = 64 # @param {type:"integer"}

In [None]:
# @title Visualization args
NUM_CLUSTERS = 50 # @param {type:"integer"}

## Setup

### Install dependencies

In [None]:
!pip install datasets
!pip install transformers
!pip install adapter-transformers==3.0.1

[0m

### Imports

In [None]:
import json
import os
import re
from functools import partial
from tqdm.auto import tqdm
from typing import Any, Iterable, Mapping

import datasets
import numpy as np
import pandas as pd
import torch
from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding, AutoModel, AutoTokenizer, AutoConfig
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE

import plotly.express as px

### Checks

In [None]:
#@markdown **Check GPU type**
!nvidia-smi -L

#@markdown **Check PyTorch version**
print("PyTorch version:", torch.__version__)
print("CUDA version:", torch.version.cuda)
print("#GPUs:", torch.cuda.device_count())

GPU 0: Tesla T4 (UUID: GPU-5e2802f0-3a72-ee6b-56ce-fc17d7e725c4)
PyTorch version: 2.0.1+cu118
CUDA version: 11.8
#GPUs: 1


### Load Huggingface Stuff

In [None]:
os.environ["TOKENIZERS_PARALLELISM"] = "true"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

config = AutoConfig.from_pretrained(MODEL_NAME, return_dict=True, output_hidden_states=True)

model = AutoModel.from_pretrained(MODEL_NAME, config=config)
if ADAPTER_NAME:
    model.load_adapter(
      ADAPTER_NAME,
      source="hf",
      set_active=True,
    )

model.eval()
model.to("cuda")

BertModel(
  (shared_parameters): ModuleDict()
  (invertible_adapters): ModuleDict()
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(31090, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (prefix_tuning): PrefixTuningShim(
              (pool): PrefixTuningPool(
                (prefix_tunings): ModuleDict()
              )
            )
          )
      

## Preparing Data

### Downloading from acl-anthology github

The paper information can be downloaded from `acl-anthology` github page in the XML format:  https://github.com/acl-org/acl-anthology/tree/master/data/xml/

In [None]:
!rm -f $FILE_NAME
!wget "https://raw.githubusercontent.com/acl-org/acl-anthology/master/data/xml/$FILE_NAME"

assert os.path.exists(FILE_NAME), "Downloaded file exists"

--2023-09-20 03:28:48--  https://raw.githubusercontent.com/acl-org/acl-anthology/master/data/xml/2023.acl.xml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2597735 (2.5M) [text/plain]
Saving to: ‘2023.acl.xml’


2023-09-20 03:28:49 (142 MB/s) - ‘2023.acl.xml’ saved [2597735/2597735]



download the xml file from this [link](https://github.com/acl-org/acl-anthology/tree/006c7247a6bf0ff859bfd3aab6ea6a19452580ad/data/xml).  
Convert the xml files to jsonl files by running the following code

### Parsing

In [None]:
import xml.etree.ElementTree as ET

URL_MAPPINGS = dict(
    D="emnlp",
    N="naacl",
    P="acl",
    Q="tacl",
)

def xml_to_jsonl(xml_file: os.PathLike) -> Iterable[Mapping[str, Any]]:
  tree = ET.parse(xml_file)
  root = tree.getroot()
  papers = root.findall(".//paper")

  for paper in papers:
    paper_dict = {}
    paper_dict["title"] = "".join(paper.find("title").itertext())

    authors = []
    for author in paper.findall("author"):
      first_name = author.findtext("first")
      last_name = author.findtext("last")
      authors.append(f"{first_name} {last_name}")
    paper_dict["authors"] = authors

    paper_dict["abstract"] = "" if paper.find("abstract")==None else "".join(paper.find("abstract").itertext())
    paper_dict["pages"] = paper.findtext("pages")
    paper_dict["url"] = paper.findtext("url")
    paper_dict["bibkey"] = paper.findtext("bibkey")
    paper_dict["doi"] = paper.findtext("doi")

    conference, paper_type = None, None
    matched = re.match(r"(\d+)\.(\w+)-(\w+)\.\d+", paper_dict["url"])
    if matched:
      year = int(matched.group(1))
      conference = matched.group(2)
      paper_type = matched.group(3)
    else:
      bibs = paper_dict["bibkey"].split("-")
      for b in range(len(bibs) - 1, -1, -1):
        try:
          year = int(bibs[b])
          break
        except ValueError:
          pass

      conference = URL_MAPPINGS.get(paper_dict["url"][0], None)

    paper_dict["source"] = conference
    paper_dict["year"] = year
    paper_dict["publication_type"] = paper_type

    yield paper_dict

papers = list(xml_to_jsonl(FILE_NAME))

print(f"#papers founds in {FILE_NAME}: {len(papers)}")

#papers founds in 2023.acl.xml: 1249


## Encode

### Creating DataLoader

In [None]:
dataset = datasets.Dataset.from_list(
  [{"text": p["title"] + tokenizer.sep_token + (p["abstract"] or ""), "idx": i + 1} for i, p in enumerate(papers)]
)

tokenize_fn = lambda batch: tokenizer(batch["text"], padding=True, truncation=True, max_length=512)
dataset = dataset.map(tokenize_fn, batched=True)

columns = ["idx", "input_ids", "attention_mask"]
if "token_type_ids" in dataset.column_names:
  columns.append("token_type_ids")

data_loader = DataLoader(
  dataset.with_format("torch", columns=columns),
  collate_fn=DataCollatorWithPadding(tokenizer),
  batch_size=BATCH_SIZE,
)

Map:   0%|          | 0/1249 [00:00<?, ? examples/s]

### Running Inference

In [None]:
embeds = []
for batch in tqdm(data_loader, desc="encoding"):
  indices = batch.pop("idx", None)
  if isinstance(indices, torch.Tensor):
    indices = indices.cpu().tolist()

  batch = {k: v.to("cuda") if v is not None else v for k, v in batch.items()}

  with torch.no_grad():
    output = model(**batch)
    encoded = output.last_hidden_state[:, 0].cpu().numpy()

  embeds.append(encoded)

embeds = np.concatenate(embeds, axis=0)

print(f"Embeddings size:", embeds.shape)

encoding:   0%|          | 0/20 [00:00<?, ?it/s]

Embeddings size: (1249, 768)


## Housekeeping prior to Visualization

To plot the embeddings, we first cluster the points and then reduce the number of dimensions to 2-d using t-SNE.

### Clustering

In [None]:
clusterer = KMeans(n_clusters=NUM_CLUSTERS, n_init="auto")
clusters = clusterer.fit(embeds).labels_

print("Clustering done")

Clustering done


### Applying t-SNE

We changed perplexity and number of iterations from their default value because the scatter plot would look nicer.

In [None]:
reducer = TSNE(n_jobs=12, perplexity=10, n_iter=3000)
reduced_embeds = reducer.fit_transform(embeds)

## Visualize

In [None]:
# @title
def to_string_authors(list_of_authors):
  if len(list_of_authors) > 5:
    return ", ".join(list_of_authors[:5]) + ", et al."
  elif len(list_of_authors) > 2:
    return ", ".join(list_of_authors[:-1]) + ", and " + list_of_authors[-1]
  else:
    return " and ".join(list_of_authors)


for i, (point, c, p) in enumerate(zip(reduced_embeds, clusters, papers)):
  p["x"] = point[0]
  p["y"] = point[1]
  p["cluster"] = c
  p["authors_trimmed"] = [(x[x.index(",") + 1 :].strip() + " " + x.split(",")[0].strip()) if "," in x else x for x in p["authors"]]
  if "publication_type" in p:
    p["type"] = p.pop("publication_type")

df = pd.DataFrame(papers)

fig = px.scatter(
  df,
  x="x",
  y="y",
  color="cluster",
  width=1000,
  height=800,
  custom_data=("title", "authors_trimmed", "year", "source", "type"),
  color_continuous_scale="fall",
)
fig.update_traces(
  hovertemplate="<b>%{customdata[0]}</b><br>%{customdata[1]}<br>%{customdata[2]}<br><i>%{customdata[3]}</i>"
)
fig.update_layout(
  showlegend=False,
  font=dict(
    family="Times New Roman",
    size=30,
  ),
  hoverlabel=dict(
    align="left",
    font_size=14,
    font_family="Rockwell",
    namelength=-1,
  ),
)
fig.update_xaxes(title="")
fig.update_yaxes(title="")

a = fig.show()