I will do large language model projects
Machine Learning, Quantitative Finance, Data
About this Gig
I will train custom language models from scratch or fine-tune open-weight LLMs on your data. I build GPT-style transformer models from zero using PyTorch, ranging from small 10M parameter demos up to 50M parameter models. I also fine-tune existing models like Llama, Phi-3, and Mistral on your dataset using LoRA/QLoRA.
What you get:
- Fully trained model weights and tokenizer tailored to your data
- Complete source code with comments for training and inference
- Text generation script + setup instructions
- Training logs, loss curves, and sample outputs
- Full commercial rights
I handle data preprocessing, tokenizer training, model architecture, and training pipeline. You just provide your text dataset in .txt, .csv, or PDF format or I'll use open source data from HuggingFace, Kaggle, and other.
Important: Models under 50M parameters are designed for demos, educational use, and learning your specific data style. They demonstrate how LLMs work but will not have broad knowledge like ChatGPT.
Expertise:
Feature learning
•
Predictive analysis
•
Other
Frameworks:
Scikit-learn
•
Keras
•
PyTorch
•
Panda
Data type:
Text
Programming language:
Python
•
SQL
•
Colab
•
NoSQL
Tools:
Jupyter Notebook
•
OpenCV
•
OpenNN
•
TensorFlow
•
Excel
•
Colab
•
Other
My Portfolio
Other Data Science & ML Services I Offer
FAQ
What exactly do I receive?
You get: 1) Trained model weights .safetensors 2) Custom tokenizer 3) Full Python source code for training + inference 4) Requirements.txt and setup guide 5) Training logs with loss/perplexity plots 6) Sample text generations 7) Full commercial rights.
Do you provide the training data?
If you have custom dataset, then you can provide the dataset. I handle cleaning, formatting, tokenization, and training. Accepted formats: .txt, .csv, .json, or PDF. But if you don't, on your choice, I'll use open source data from websites like HuggingFace, Kaggle, and others to train our model.
Will my 10M or 50M model be like ChatGPT?
No. Models under 100M parameters are for demos, proof-of-concepts, and learning specific styles/patterns from your data. They will generate text in your domain style but won't have broad knowledge, reasoning, or instruction-following like ChatGPT. For that you need 7B+ models with massive datasets.
How much data do I need to provide?
For 10M models: 10MB-100MB of text. For 50M models: 50MB-500MB of text. More data = better results. 1MB ≈ 200k tokens. If you're unsure, send me your dataset and I'll check if it's sufficient before we start.
