Abstract

We present cdx1 and cdx1-pro, a family of language models designed to emulate the expertise of a professional in DevOps, xBOM (Bill of Materials), and the CycloneDX specification. The base models, unsloth/Qwen2.5-Coder-14B-Instruct (for cdx1) and unsloth/Qwen3-Coder-30B-A3B-Instruct (for cdx1-pro), were fine-tuned on a specialized, high-quality dataset. This dataset was constructed using a synthetic data generation strategy with a teacher model (Gemini 2.5 Pro). The primary objective was to align the fine-tuned models' capabilities with the teacher model's performance on xBOM and CycloneDX-related question-answering tasks.

Approach to Data

Data Curation and Generation

The models were trained on cdx-docs, a curated dataset comprising technical documentation, authoritative OWASP guides, and semantic interpretations derived from the CycloneDX Generator (cdxgen) source code. The dataset was augmented using a synthetic data generation technique. This process involved prompting a teacher model (Gemini 2.5 Pro) to generate question-answer pairs that encapsulate the nuances and semantics of the domain. The generated data was structured to facilitate effective learning by the target cdx1 models.

Alignment with Inference

During the training phase, the dataset was iteratively refined to ensure the format and context of the training examples closely resembled the intended inference-time inputs. This alignment is critical for the models to learn the domain's complexity and respond accurately to real-world prompts.

Benchmarking

The cdx1 models are optimized for xBOM use cases, including BOM summarization, component tagging, validation, and troubleshooting. To evaluate model performance, we developed a custom benchmark suite named xBOMEval.

Scoring

Model responses were scored using a combination of automated evaluation by a high-capability model (Gemini 2.5 Pro) and manual human review. To maintain benchmark integrity, the evaluation set was held out and not included in any model's training data. Detailed results and configurations are available in the xBOMEval directory of the cdxgen repository.

Benchmark Results - August 2025

Logic Category Comparison

The Logic category tests reasoning and problem-solving skills. The table below compares the accuracy of nine models on these tasks.

Model	Accuracy (%)
cdx1-mlx-8bit	46.04
cdx1-pro-mlx-8bit	73.17
gemini-2.5-pro	93.60
o4-mini-high	67.99
qwen3-coder-480B	48.48
deepthink-r1	89.63
deepseek-r1	82.92
gpt-oss-120b	80.49
gpt-oss-20b	79.27

Summary of Results:

Top Performer: gemini-2.5-pro achieved the highest accuracy at 93.6%.
High Performers: A group of models demonstrated strong reasoning, including deepthink-r1 (89.6%), deepseek-r1 (82.9%), and gpt-oss-120b (80.5%).
Specialized Model Performance: cdx1-pro (30B parameters) performed competitively at 73.2%. The score for cdx1 (14B parameters) was 46.0%, a result attributed primarily to context length limitations rather than a fundamental deficiency in logic.
Performance Tiers: The results indicate distinct performance tiers, with a significant gap between the top-performing models (>80%) and others.

Specification Category Comparison

The Spec category tests the recall of factual information from technical specifications.

Model	Accuracy (%)
cdx1-mlx-8bit	83.52
cdx1-pro-mlx-8bit	98.3
gemini-2.5-pro	100
o4-mini-high	0
qwen3-coder-480B	90.34
deepthink-r1	12.36
deepseek-r1	98.58
gpt-oss-120b	89.2
gpt-oss-20b	9.09

Summary of Results:

Near-Perfect Recall: gemini-2.5-pro (100%), deepseek-r1 (98.6%), and cdx1-pro (98.3%) demonstrated exceptional performance.
Behavioral Failures: Three models scored poorly due to operational issues rather than a lack of knowledge. o4-mini-high (0%) refused to answer, while deepthink-r1 (12.4%) and gpt-oss-20b (9.1%) answered only a small fraction of questions.
cdx1 Performance: The smaller cdx1 model scored 83.5%. Its performance was negatively affected by a systematic misunderstanding of certain technical terms, highlighting the challenge of ensuring factual accuracy in highly specialized domains.

Other Categories

Performance in additional technical categories is summarized below.

Category	cdx1-mlx-8bit	cdx1-pro-mlx-8bit
DevOps	87.46%	96.1%
Docker	89.08%	100%
Linux	90.6%	95.8%

Model Availability

The cdx1 and cdx1-pro models are provided in multiple formats and quantization levels to facilitate deployment across diverse hardware environments. Models are available in the MLX format, optimized for local inference on Apple Silicon, and the GGUF format, which offers broad compatibility with CPUs and various GPUs. The selection of quantization levels allows users to balance performance with resource consumption, enabling effective operation even in environments with limited VRAM.

The table below details the available formats and their approximate resource requirements. All quantized models can be found on Hugging Face.

Model	Format	Quantization	File Size (GiB)	Est. VRAM (GiB)	Notes
cdx1 (14B)	MLX	4-bit	~8.1	> 8	For Apple Silicon with unified memory.
	MLX	6-bit	~12	> 12	For Apple Silicon with unified memory.
	MLX	8-bit	~14.2	> 14	Higher fidelity for Apple Silicon.
	MLX	16-bit	~30	> 30	bfloat16 for fine-tuning.
	GGUF	Q4_K_M	8.99	~10.5	Recommended balance for quality/size.
	GGUF	Q8_0	15.7	~16.5	Near-lossless quality.
	GGUF	BF16	29.5	~30	bfloat16 for fine-tuning.
cdx1-pro (30B)	MLX	4-bit	~17.5	> 18	For Apple Silicon with unified memory.
	MLX	6-bit	~24.8	> 25	For Apple Silicon with unified memory.
	MLX	8-bit	~32.4	> 33	Higher fidelity for Apple Silicon.
	MLX	16-bit	~57	> 57	bfloat16 for fine-tuning.
	GGUF	Q4_K_M	18.6	~20.0	Recommended balance for quality/size.
	GGUF	IQ4_NL	17.6	~20.0	Recommended balance for quality/size.
	GGUF	Q8_0	32.5	~33	Near-lossless quality.
	GGUF	Q2_K	11.3	~12	Low quality. Use for speculative decoding.
	GGUF	BF16	57	~60	bfloat16 for fine-tuning.

Notes on Quantization and Formats:

IQ4_NL (Importance-aware Quantization, Non-Linear): A sophisticated 4-bit method that preserves important model weights with higher precision. It often provides superior performance compared to standard 4-bit quants at a similar file size and is a strong alternative to Q4_K_M.
K-Quants (Q2_K, Q4_K_M): This family of quantization methods generally offers a better quality-to-size ratio than older _0 or _1 variants.
Q2_K: An extremely small 2-bit quantization designed for environments with severe resource limitations. Users should anticipate a noticeable reduction in model accuracy and coherence in exchange for the minimal VRAM and storage footprint.
Q8_0: A full 8-bit quantization that provides high fidelity at the cost of a larger file size. It is suitable for systems with ample VRAM.
VRAM Requirements: The values provided are estimates for loading the model and processing a moderate context. Actual VRAM consumption can vary based on factors such as context length, batch size, and the specific inference software used.

Safety and Bias

(To be determined)

Weaknesses

(To be determined)

Acknowledgments

(To be determined)

Citation

Please cite the following resources if you use the datasets, models, or benchmark in your work.

For the Dataset

@misc{cdx-docs,
  author = {OWASP CycloneDX Generator Team},
  title = {{cdx-docs: A Curated Dataset for SBOM and DevOps Tasks}},
  year = {2025},
  month = {February},
  howpublished = {\url{https://huggingface.co/datasets/CycloneDX/cdx-docs}}
}

For the Models

@misc{cdx1_models,
  author = {OWASP CycloneDX Generator Team},
  title = {{cdx1 and cdx1-pro: Language Models for SBOM and DevOps}},
  year = {2025},
  month = {February},
  howpublished = {\url{https://huggingface.co/CycloneDX}}
}

For the xBOMEval Benchmark

@misc{xBOMEval_v1,
  author = {OWASP CycloneDX Generator Team},
  title = {{xBOMEval: A Benchmark for Evaluating Language Models on SBOM Tasks}},
  year = {2025},
  month = {August},
  howpublished = {\url{https://github.com/CycloneDX/cdxgen}}
}

Licenses

Datasets: CC0-1.0
Models: Apache-2.0

CycloneDX
/

cdx1-mlx

Abstract

Approach to Data

Data Curation and Generation

Alignment with Inference

Benchmarking

Categories

Scoring

Benchmark Results - August 2025

Logic Category Comparison

Specification Category Comparison

Other Categories

Model Availability

Safety and Bias

Weaknesses

Acknowledgments

Citation

For the Dataset

For the Models

For the xBOMEval Benchmark

Licenses

Model tree for CycloneDX/cdx1-mlx

Dataset used to train CycloneDX/cdx1-mlx

Collection including CycloneDX/cdx1-mlx

cdx1