ChatGPT Fine Tuning

When to Fine Tune?

To get good result, firstly do prompt engineering, prompt chaining (breaking complex tasks into multiple prompts), and function calling.

Only when the above solutions can hardly improve, we can consider fine-tuning.

RAG VS Fine Tune?

Fine-tuning can be used to make a model which is narrowly focused, and exhibits specific ingrained behavior patterns.

RAG is not an alternative to fine-tuning and can in fact be complementary to it.

large dataset => RAG

Common use cases

Setting the style, tone, format, or other qualitative aspects
Producing a desired output, performing new skills
Correcting failures to follow complex prompts
Handling edge cases in specific ways
Shorten prompt, use less tokens

Esitmated Cost

For a training file with 1M tokens trained over 3 epochs, the expected cost would be:

~$0.90 USD with gpt-4o-mini-2024-07-18 after the free period ends on October 31, 2024.
~$2.40 USD with gpt-3.5-turbo-0125.

Supervised Fine-Tuning (SFT)

SFT will train model with the completion dataset
It mainly make LLM understand the task, learn the instruction and expected behaviours.
It hardly let LLM learn knowldege as the updated params and dataset are way less than the pre-training phase

Datastruct

Each example in the dataset should be a conversation in the same format as Chat Completions API
- a list of messages
- role, content, optional name
- Train and Test set(for eval)
Data Size
- input token 128k, output token 65k for chatgpt4.1
Data Amount
- At least 10 examples, Typically see improvements from fine-tuning on 50 to 100
- Consider scaling up the number of training examples. This tends to help the model learn the task better, especially around possible "edge cases".
Data Quality
- Balance and Diversity: good case + bad case, as well as desired assistant response
  - Collect examples to target remaining issues
  - Consider the balance and diversity of data
  - Contain all of the information needed for the response
- The agreement / consistency between people
- in the same format, as expected for inference

Example Dataset

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

Multi-turns examples Dataset

To skip fine-tuning on specific assistant messages, a weight key can be added, 0 means not learn

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris", "weight": 0}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already.", "weight": 1}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "William Shakespeare", "weight": 0}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?", "weight": 1}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "384,400 kilometers", "weight": 0}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters.", "weight": 1}]}

Code

from openai import OpenAI
client = OpenAI()

# SFT
client.fine_tuning.jobs.create(
    training_file="file-abc123",
    model="gpt-4o-mini-2024-07-18"
)

Preference fine-tuning(Direct Preference Optimization, DPO)

Direct Preference Optimization (DPO) fine-tuning allows you to fine-tune models based on prompts and pairs of responses.

This approach enables the model to learn from human preferences, optimizing for outputs that are more likely to be favored.

Suggesting workflow: SFT + DPO

SFT using a subset of your preferred responses.
Apply DPO to SFT fine-tuned model to adjust enhance preference comparisons.

Dataset required

Each example in your dataset should contain:

A prompt, like a user message.
A preferred output (an ideal assistant response).
A non-preferred output (a suboptimal assistant response).
Restriction: currently only support one turn conversation, the preferred and non-preferred messages need to be the last assistant message

{
  "input": {
    "messages": [
      {
        "role": "user",
        "content": "Hello, can you tell me how cold San Francisco is today?"
      }
    ],
    "tools": [],
    "parallel_tool_calls": true
  },
  "preferred_output": [
    {
      "role": "assistant",
      "content": "Today in San Francisco, it is not quite cold as expected. Morning clouds will give away to sunshine, with a high near 68°F (20°C) and a low around 57°F (14°C)."
    }
  ],
  "non_preferred_output": [
    {
      "role": "assistant",
      "content": "It is not particularly cold in San Francisco today."
    }
  ]
}

Params

beta: [0, 2]
1. A high number will be more conservative (favoring previous behavior), and a lower number will be more aggressive (favor the newly provided preferences more often).

Code

from openai import OpenAI

client = OpenAI()

job = client.fine_tuning.jobs.create(
    training_file="file-all-about-the-weather",
    model="gpt-4o-2024-08-06",
    method={
        "type": "dpo",
        "dpo": {
            "hyperparameters": {"beta": 0.1},
        },
    },
)

Use Fine-Tuned Model

from openai import OpenAI
client = OpenAI()

# SFT
client.fine_tuning.jobs.create(
    training_file="file-abc123",
    model="gpt-4o-mini-2024-07-18"
)

# DPO
job = client.fine_tuning.jobs.create(
    training_file="file-all-about-the-weather",
    model="gpt-4o-2024-08-06",
    method={
        "type": "dpo",
        "dpo": {
            "hyperparameters": {"beta": 0.1},
        },
    },
)

completion = client.chat.completions.create(
    model="ft:gpt-4o-mini:my-org:custom_suffix:id",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ]
)

print(completion.choices[0].message)

example

Style and tone

gets the model follow specific style and tone guidance beyond what is possible with prompting alone.

In this example, our goal is to create a chatbot that occasionally gives sarcastic responses

{"messages": [{"role": "system", "content": "Given a sports headline, provide the following fields in a JSON dict, where applicable: \"player\" (full name), \"team\", \"sport\", and \"gender\"."}, {"role": "user", "content": "Sources: Colts grant RB Taylor OK to seek trade"}, {"role": "assistant", "content": "{\"player\": \"Jonathan Taylor\", \"team\": \"Colts\", \"sport\": \"football\", \"gender\": \"male\" }"}]}

Structured output

getting the model to provide structured information, in this case about sports headlines:

{"messages": [{"role": "system", "content": "Given a sports headline, provide the following fields in a JSON dict, where applicable: \"player\" (full name), \"team\", \"sport\", and \"gender\"."}, {"role": "user", "content": "Sources: Colts grant RB Taylor OK to seek trade"}, {"role": "assistant", "content": "{\"player\": \"Jonathan Taylor\", \"team\": \"Colts\", \"sport\": \"football\", \"gender\": \"male\" }"}]}

Tool calling

Get similarly formatted responses even when the full tool definition isn't present
Get more accurate and consistent outputs

Tips: If your goal is to maximize the correctness of the function calling output, we recommend using the same tool definitions for both training and querying the fine-tuned model.

code review

{"messages": [{"role": "system", "content": "You are a highly skilled code review assistant specializing in analyzing pull requests and providing specific, actionable code review comments.\r\nYour task is to analyze the following merge request and generate constructive feedback based on best coding practices.\r\nProvide your answer in English.\r\n\r\nSample merge request code diff from user:\r\n======\r\nUser's additional requirement:(optional)\r\n...\r\n\r\nMerge Request Context:\r\nOld File Path: internal\/bridge\/dm\/billing_item_clickhouse.go\r\nNew File Path: internal\/bridge\/dm\/billing_item_clickhouse.go\r\nNew File: false\r\nRenamed File: false\r\nDeleted File: false\r\nMerge Request Title: <master>[SPOO-12454]Fix Stuck payout\r\n\r\nCode Diff to Review:\r\n---\r\nCode Diffs in internal\/bridge\/dm\/billing_item_clickhouse.go\r\nCode Diff Chunk 1\r\n100,100:  unchanged code line\r\n  0,101:+ new code line\r\n101,  0:- removed code line\r\n102,102:  unchanged code line\r\n103,103:  unchanged code line\r\n  0,104:+ new code line\r\n\r\nCode Diff Chunk 2\r\n......\r\n---\r\n\r\nCurrent File Content after update(optional):\r\n---\r\n(The full content of the current file in the current branch)\r\n---\r\n======\r\n\r\nImportant notes about the structured code diff format above:\r\n1. Each PR code chunk is composed \r\n2. The diff uses line prefixes to show changes(takes \r\n  The first number is the old line number, the second number is the new line number\r\n  If first(old) line number is 0 and the marker is '+', it means the line is added in the new file\r\n  If second(old) line number is 0 and the marker is '-', it means the line is removed in the old file\r\n  If both line numbers are the same, it means the line is unchanged\r\n3. The third character is the marker of the line:\r\n  '+' \u2192 new line code added\r\n  '-' \u2192 line code removed\r\n  ' ' \u2192 unchanged line code\r\n\r\n\r\nSpecific guidelines for generating code suggestions:\r\n- You should assume that the code has passed static code analysis, don't suggest to check for duplicate code, wrong import, or unused variables, and so on.\r\n- Only give suggestions that address critical problems and bugs in the PR code. If no relevant suggestions are applicable, return an empty list.\r\n- Should only response the comment with score larger or equal to  6.\r\n- Provide up to 10 distinct and insightful code suggestions. Return less suggestions if no pertinent ones are applicable.\r\n- DO NOT suggest implementing changes that are already present in the '+' lines compared to the '-' lines.\r\n- Focus your suggestions ONLY on new code introduced in the PR ('+' lines).\r\n- Do not suggest to change packages version, add missing import statement, or declare undefined variable.\r\n- When mentioning code elements (variables, names, or files) in your response, surround them with backticks([BACKTICK]). For example: \"verify that [BACKTICK]user_id[BACKTICK] is...\"\r\n- Note that you only see changed code segments, not the entire codebase. Avoid suggestions that might duplicate existing functionality or questioning code elements (like variables declarations or import statements) that may be defined elsewhere in the codebase.\r\n- Provide feedback in language: en-US\r\n- Don't Evaluate\r\n\t\t- Possible duplicate code chunk\/function\r\n\t\t- Unit tests\r\n\t\t- Mock or generated files\r\n\t\t- CI\/CD configuration\r\n\t\t- Documentation (including comments and README changes)\r\n\t\t- Import statements\r\n\t\t- Log or print statements\r\n\t\t- The function or method signature\/definition changes\r\n- Pay more attention to:\r\n\t- Potential Bug(such as Nil Pointer Dereference, Index Out of Range, Channel Misuse, type assertion error, Concurrency Issues, JSON & Data Parsing Errors, Improper Use of Time & Context)\r\n\t- Security\r\n\t- Performance(such as DB query missing index, Race Condition, goroutine leak, Memory Leaks & Resource Management, Performance Bottlenecks)\r\n\r\n\r\nOutput Format\r\nYour response should be in JSON format, conforming to the following Golang type definition:\r\n[BACKTICK][BACKTICK][BACKTICK]go\r\ntype CodeReview struct {\r\n\tReviews []Review [BACKTICK]json:\"reviews\"[BACKTICK]\r\n}\r\n\r\ntype Review struct {\r\n\tSeverityScore      int    [BACKTICK]json:\"severity_score\"[BACKTICK]       \/\/ Severity score of the suggestion [1,10], don't return if the score is less than 6\r\n\tOneSentenceSummary string [BACKTICK]json:\"one_sentence_summary\"[BACKTICK] \/\/ Brief summary of the suggestion (max 6 words)\r\n\tSuggestionContent  string [BACKTICK]json:\"suggestion_content\"[BACKTICK]   \/\/ Concise, actionable suggestion for improvement\r\n\tRelevantFilePath   string [BACKTICK]json:\"relevant_file_path\"[BACKTICK]        \/\/ the relevant file path\r\n\tStartLineNew       string [BACKTICK]json:\"start_line_new\"[BACKTICK]   \/\/ Start new line number should be a few(0~5) rows before the problem\r\n\tStartLineOld       string [BACKTICK]json:\"start_line_old\"[BACKTICK]   \/\/ Start old line number associated with the startlinenew\r\n\tStartLineMarker    string [BACKTICK]json:\"start_line_marker\"[BACKTICK] \/\/ The marker could be '+', '-', or ' ' according to above description\r\n\tEndLineNew         string [BACKTICK]json:\"end_line_new\"[BACKTICK]  \/\/ End new line number should be a few(0~3) rows **after** the problem\r\n\tEndLineOld         string [BACKTICK]json:\"end_line_old\"[BACKTICK]    \/\/ End old number associated with the EndLineNew\r\n\tEndLineMarker      string [BACKTICK]json:\"end_line_marker\"[BACKTICK]  \/\/The marker could be '+', '-', or ' ' according to above description\r\n\tLabel              string [BACKTICK]json:\"label\"[BACKTICK] \/\/ Suggestion category: \"security\", \"possible bug\", \"performance\", or \"general\"\r\n    ExistingCode       string [BACKTICK]json:\"existing_code\"[BACKTICK]      \/\/ Code snippet from the new PR changes that needs improvement\r\n    ImprovedCode       string [BACKTICK]json:\"improved_code\"[BACKTICK]      \/\/ Improved version of the problematic code snippet\r\n}\r\n\r\n[BACKTICK][BACKTICK][BACKTICK]\r\nIf no issues are found, return: CodeReview{ Reviews: []Review{} }\r\n==="}, {"role": "user", "content": "Merge Request Context:\nOld File Path: internal/module/service_1.go\nNew File Path: internal/module/service_1.go\nNew File: false\nRenamed File: false\nDeleted File: false\nMerge Request Title: [MOD-2001] Update handling logic\n\nCode Diff to Review:\n---\nCode Diffs in internal/module/service_1.go\nCode Diff Chunk 1\n50,50:  unchanged code line\n  0,51:+ for _, item := range items { result = append(result, process(item)) }\n51,  0:- previous line of code\n52,52:  unchanged code line\n---"}, {"role": "assistant", "content": "{\"reviews\": [{\"severity_score\": 7, \"one_sentence_summary\": \"Inefficient loop causing performance bottleneck\", \"suggestion_content\": \"Replace slice appending in loop with pre-allocated slice.\", \"relevant_file_path\": \"internal/module/service_1.go\", \"start_line_new\": \"50\", \"start_line_old\": \"50\", \"start_line_marker\": \" \", \"end_line_new\": \"52\", \"end_line_old\": \"52\", \"end_line_marker\": \" \", \"label\": \"performance\", \"existing_code\": \"for _, item := range items { result = append(result, process(item)) }\", \"improved_code\": \"result := make([]Type, len(items)); for i, item := range items { result[i] = process(item) }\"}]}"}]}

Q&A

Which model we can Fine Tune?
1. chatgpt 4o, 4o-mini, 4.1, 4.1-mini
Can LLM remember thing from SFT?
1. such as we build a dataset from a FQA userguide, will it let LLM undetstand the result?

Reference

openai fine tune https://platform.openai.com/docs/guides/fine-tuning
openai pricing https://openai.com/api/pricing/
Direct Preference Optimization https://arxiv.org/abs/2305.18290
How to count tokens with Tiktoken https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken
Data preparation and analysis for chat model fine-tuning https://cookbook.openai.com/examples/chat_finetuning_data_prep
toy_chat_fine_tuning.jsonl https://github.com/openai/openai-cookbook/blob/main/examples/data/toy_chat_fine_tuning.jsonl
Model distillation https://platform.openai.com/docs/guides/distillation#create-training-dataset-to-fine-tune-smaller-model

ChatGPT Fine Tuning

When to Fine Tune?​

RAG VS Fine Tune?​

Common use cases​

Esitmated Cost​

Supervised Fine-Tuning (SFT)​

Datastruct​

Example Dataset​

Multi-turns examples Dataset​

Code​

Preference fine-tuning(Direct Preference Optimization, DPO)​

Dataset required​

Params​

Code​

Use Fine-Tuned Model​

example​

Q&A​

Reference​

When to Fine Tune?

RAG VS Fine Tune?

Common use cases

Esitmated Cost

Supervised Fine-Tuning (SFT)

Datastruct

Example Dataset

Multi-turns examples Dataset

Code

Preference fine-tuning(Direct Preference Optimization, DPO)

Dataset required

Params

Code

Use Fine-Tuned Model

example

Q&A

Reference