I want to train LLM with structured dataset such as database with multiple tables. Here is an simple example:
t_user:
| id | name |
|---|---|
| 1 | Jason |
| 2 | Eric |
| 3 | David |
t_book:
| id | title |
|---|---|
| 1 | Gone with The Wind |
| 2 | Brave New World |
| 3 | Native Son |
t_bookstore:
| id | name |
|---|---|
| 1 | The Book Nook |
| 2 | The Literary Loft |
| 3 | Wordsmith Books |
t_order:
| user_id | bookstore_id | book_id |
|---|---|---|
| 1 | 2 | 1 |
| 1 | 2 | 2 |
| 2 | 3 | 3 |
After training the LLM, it can reason on this relational database. Such as when I ask:
how many books did Jason buy in The Literary Loft bootstore? It can answers: 2 books, Gone with The Wind and Brave New World.
How can I prepare the corpus from the database and train the LLM ?
I have searched for some solutions, such as ask the LLM to transform prompts to sql queries, but that’s not what I want.