Large Language Model (LLM)A neural network trained on vast quantities of text to predict and generate language. LLMs learn grammar, facts, reasoning patterns, and conversational style from the training data.
TokenThe basic unit of text that an LLM processes — roughly 3–4 characters or about ¾ of a word in English. Pricing for LLM APIs is usually quoted per 1 million tokens.
TransformerThe neural network architecture behind almost all modern LLMs. Uses self-attention mechanisms to weigh the relevance of every token to every other token in the sequence.
RAG (Retrieval-Augmented Generation)A technique that retrieves relevant documents from a knowledge base and includes them in the prompt before generation, grounding the model in current, specific information.
HallucinationWhen an AI model generates confident-sounding text that is factually incorrect or entirely fabricated. A key limitation of current LLMs.
Fine-tuningContinuing to train a pre-trained model on a smaller, task-specific dataset to adapt its behaviour for a particular domain or task.
EmbeddingA dense numerical vector representation of text that captures its semantic meaning. Similar meanings produce similar vectors, enabling search and retrieval.
Context WindowThe maximum number of tokens an LLM can process at once — covering both the input prompt and the output. Larger context windows allow longer documents and conversations.
TemperatureA parameter controlling the randomness of model outputs. Low temperature makes outputs more deterministic; high temperature increases creativity but also errors.
Attention MechanismThe core operation of a transformer. For each token, attention computes a weighted sum of all other tokens, allowing the model to focus on the most relevant parts of the input.
RLHFReinforcement Learning from Human Feedback — a training technique that uses human preferences to improve model outputs via a reward model trained on human comparisons.
Vector DatabaseA database optimised for storing and searching embeddings. Given a query embedding, it efficiently returns the most semantically similar stored vectors. Foundational to RAG systems.
InferenceRunning a trained model to generate outputs. When you send a prompt to an API and receive a response, inference is happening. Cost scales with token count and model size.
QuantisationReducing the numerical precision of model weights to make models smaller and faster, trading a small amount of accuracy for major gains in memory and inference speed.
LoRA (Low-Rank Adaptation)A parameter-efficient fine-tuning technique that adds small trainable adapter matrices instead of updating all model weights, making fine-tuning much cheaper.
Knowledge CutoffThe date after which a model has no training data. Events after this date are unknown to the model unless provided via RAG or search tools.
Enable JavaScript to view all 60+ terms with search and category filters.