--- license: apache-2.0 datasets: - alxxtexxr/indowebgen-dataset language: - id --- # 🇮🇩🌐🤖 IndoWebGen: LLM for Automated (Bootstrap-Based) Website Generation Based-On Indonesian Instructions Hugely inspired by [Web App Factory](https://huggingface.co/spaces/jbilcke-hf/webapp-factory-wizardcoder). ## Model Description: - Base Model: [codellama/CodeLlama-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf) [[1](https://arxiv.org/abs/2308.12950)] - Finetuning Method: LoRA [[2](https://arxiv.org/abs/2106.09685)] - Dataset: [alxxtexxr/indowebgen-dataset](https://huggingface.co/datasets/alxxtexxr/indowebgen-dataset) ## Finetuning Hyperparameters: - Number of Epochs: 20 - Microbatch Size: 4 - Gradient Accumulation Step: 8 - LoRA Rank: 16 - LoRA Alpha: 32 - LoRA Target Modules: [q_proj, v_proj] ## Inference: Try the inference demo [here](https://indowebgen.alimtegar.my.id) or try running the inference code with the provided Google Colab notebook [here](https://colab.research.google.com/drive/1pqqLGcgRcUTBLCNeF0V6REi7INJ43IZb?usp=sharing). The inference code used is shown below: ``` # Install the required libraries !pip install transformers bitsandbytes accelerate # Import the neccessary modules from transformers import AutoModelForCausalLM, AutoTokenizer # Load the model and the tokenizer model_id = 'alxxtexxr/indowebgen-7b' model = AutoModelForCausalLM.from_pretrained( model_id, load_in_8bit=True, # load_in_4bit=True, # for low memory device_map='auto', ) tokenizer = AutoTokenizer.from_pretrained(model_id) # Initialize the prompt prompt_template = '''Berikut adalah instruksi pembuatan website beserta output-nya yang berupa kode HTML dari website yang dibuat: ### Instruksi: {instruction} ### Output: ''' # INSERT YOUR OWN INDONESIAN INSTRUCTION BELOW instruction = 'Buatlah website portfolio untuk Budi' prompt = prompt_template.format(instruction=instruction) # Generate the output input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(model.device) outputs = model.generate( input_ids, max_new_tokens=2400, do_sample=True, temperature=1.0, top_k=3, top_p=0.8, repetition_penalty=1.1, pad_token_id=tokenizer.unk_token_id, ) print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]) ``` ## Limitations - The dataset used in training is limited to only 500 data, so the model performance may still not be optimal. - The model is designed to generate single-page static websites, constructed using HTML with internal CSS. - The content of the generated websites is dummy (including the images), so the users need to further customize the websites. - The generated websites leverage Bootstrap for the styling, Font Awesome for the icons, and dummyimage.com images for the dummy images.