Bug Bytes Web

Introduction to Pydantic

pydantic is a Python library that handles data validation and settings management using type-annotated class-fields. In this post, we will cover the basics of pydantic, and see how to use it to model and validate JSON data coming from an external source. We'll see how to create constrained fields, write custom field validations, and how to export models to JSON/dictionaries.

Often, when dealing with data from files, external APIs or from users, we need to validate this data and often convert data from one type to another - for example, converting a string representation of a number to an integer or a float. We may need to account for optional fields, fields with dynamic default values, and fields with very specific validation rules.

We will see examples of how to deal with these issues using pydantic throughout this post, and future posts. The series is outlined below:

Introduction to Pydantic (this post)
Nested models and JSON Schemas
Field() function, Exporting Models, and Model Config class
Pydantic Validators

The associated video for this post can be found below:

Objectives

In this post, we will learn:

How to create Pydantic model classes to define the structure of our data.
How to use Python types to determine the expected data-types of our class fields.
How to specify Optional fields that may or may not be present in data sources.
How to specify constrained fields that only permit a subset of values (for example, a numerical field that only allows values between 0 and 1000).
How to write custom validator functions with Pydantic.
How to export models to Python dictionaries and JSON strings.

Pydantic Basics

To illustrate how we can use pydantic, we'll start by using a canned dataset of student information that can be found on Github here.

This is JSON data from an external source, and we may want to validate it in our applications. A single record has the following shape:

    {
        "id": "d15782d9-3d8f-4624-a88b-c8e836569df8",
        "name": "Eric Travis",
        "date_of_birth": "1995-05-25",
        "GPA": "3.0",
        "course": "Computer Science",
        "department": "Science and Engineering",
        "fees_paid": false
    }

There are a few things we need to account for with this data, so let's say the following:

The date_of_birth is represented as a string in the dataset - we want to convert this to a date object.
The GPA is also represented as a string - we want to convert this to a floating-point number between 0 and 4.
The course field is potentially null for some records, as some students may not yet be assigned to a course. We need to make this field optional.
The department field should be constrained to a small set of permissible values, since the university/college only has a small number of departments.

So let's get started!

With pydantic, we can build a class to determine the shape of the data we are expecting, as well as the data-types for each field. These classes are called models, and they inherit from a BaseModel class in pydantic.

Within the model class, we create fields for the data that we expect from our data source.

We're going to need two Python libraries for this tutorial - firstly, pydantic itself, and also the requests module. You can install these with the following command:

pip install pydantic requests

Let's start by writing code to get the sample JSON data from Github into our application. Create a Python file with any name you like, and add the following code.

import requests 

url = 'https://raw.githubusercontent.com/bugbytes-io/datasets/master/students_v1.json'

response = requests.get(url)
data = response.json()
print(data)

This should print the dictionary of data to the terminal.

We now want to pass the data that we've retrieved to a Pydantic model that will validate the shape of the data and return the correct fields and data-types. So let's create the model!

Note the field types for our data:

id - this should be a UUID object (a universally unique identifier). Python has a uuid module in the standard library which we can use to validate this field
name, course, and department - these are strings.
GPA - this should be a floating-point number
date_of_birth - this should be a date object
fees_paid - this is a boolean field

We'll now define a model that inherits from Pydantic's BaseModel class, and has the above fields/types. This will be a simple, initial implementation - we will extend it later to account for optional fields and add additional validation and constraints.

import uuid
from datetime import date
import requests
from pydantic import BaseModel

url = 'https://raw.githubusercontent.com/bugbytes-io/datasets/master/students_v1.json'
data = requests.get(url).json()

# define Pydantic model class
class Student(BaseModel):
    id: uuid.UUID
    name: str
    date_of_birth: date
    GPA: float
    course: str
    department: str
    fees_paid: bool

for student in data:
    # create Pydantic model object by unpacking key/val pairs from our JSON dict as arguments 
    model = Student(**student)
    print(model)

If you execute this script, you will see that the model throws a ValidationError with the following message:

pydantic.error_wrappers.ValidationError: 1 validation error for Student
course
  none is not an allowed value (type=type_error.none.not_allowed)

The validation error tells us that, on the Student model, the course field is receiving a type that is not allowed - the None type.

If we look at the course field's definition, it specifies that we should receive a string here. However, as noted above:

The course field is potentially null for some records, as some students may not yet be assigned to a course. We need to make this field optional.

We need to make this field nullable on our model - Python's typing module has an Optional type that can implement this requirement. For example, an Optional[int] field would expect an integer if a value is passed, but values can be omitted too, in which case the field is set to None.

Note that Optional[int] is shorthand for Union[int, None], where a Union of different types means that the value can take any one of the specified types.

From Python3.10+, we can write Optional[int] with a shorthand syntax: int | None.

Let's add this to our model for the course field:

class Student(BaseModel):
    id: uuid.UUID
    name: str
    date_of_birth: date
    GPA: float
    course: str | None
    department: str
    fees_paid: bool

Note that the course field on line 6 has had its type changed to str | None, indicating that this is an optional field.

Try running the script again. You should now find that there are no ValidationErrors, and the script converts each row of JSON to a Pydantic model with the correct types for each field.

The first object output is shown below:

Student(
    id=UUID('d15782d9-3d8f-4624-a88b-c8e836569df8'),
    name='Eric Travis',
    date_of_birth=datetime.date(1995, 5, 25),
    GPA=3.0,
    course='Computer Science',
    department='Science and Engineering',
    fees_paid=False
)

Notice the conversion of the types - the raw strings for the id and date_of_birth fields have been converted to UUID and date objects, respectively. The GPA field has been converted from a string to a float, too.

And in the case of missing/null course values, the field is filled in with the None type, thanks to our new Optional definition.

This has all been achieved just by defining this BaseModel and giving the correct types to each field. Very easy and Pythonic!

Constrained Types

Let's introduce another concept from Pydantic - that of constrained types.

A constraint on a field is essentially a limit on the potential values it can take on. To demonstrate, let's say we have a user record as follows:

class User(BaseModel):
    name: str
    age: int

OK, this makes sense, but we may want to constrain the values of the age field. For example, age cannot be a negative number, and is unlikely to exceed a specific upper bound (let's say 130).

If a user sends an errant number to our server, with an age of (let's say) 2000, the current implementation will accept this age without question.

Enter, constrained types!

For each Python primitive, pydantic has a constrained variant. For example, for the int type, there's a conint type available in pydantic. We also have confloat, constr, conlist, etc.

Each of these constrained types have some arguments that can define the constraints. We can do things such as enforce upper- or lower-case (strings), enforce lower/upper bounds on the number of items in a list/set/frozenset, and enforce lower/upper bounds on the values of numbers (ints, floats, Decimals).

Let's use this with our age field above:

from pydantic import BaseModel, conint

class User(BaseModel):
    name: str
    age: conint(gt=0, lt=130)

# dummy data
u1 = {'name': 'xyz', 'age': 50}
u2 = {'name': 'xyz', 'age': 150}

# attempt to convert to Pydantic models
user1 = User(**u1)  # this is fine
user2 = User(**u2)  # ValidationError - age in 'u2' exceeds upper bound

Line 5 defines our constrained integer type.

Note: the same effect can be achieved using Pydantic's Field function, too.

from pydantic import BaseModel, Field

class User(BaseModel):
    name: str
    age: int = Field(..., gt=0, lt=130)

The Field() function is useful for specifying additional information and validation/constraints on fields.

Let's go back to our original Student model, and add a constraint on the GPA field. This should only take values between 0 and 4 (inclusive).

from pydantic import BaseModel, confloat

class Student(BaseModel):
    id: uuid.UUID
    name: str
    date_of_birth: date
    GPA: confloat(ge=0, le=4)
    course: str | None
    department: str
    fees_paid: bool

We use the confloat type to determine this constraint. Now, if we receive data with a GPA outside of this range, the model will reject the data and throw a ValidationError.

Custom Validator Functions

Sometimes, our validation logic cannot be expressed as simply as with the built-in constrained types. We also may need to use dynamic values when validating, such as fetching the current datetime. For these purposes, we can use custom validator functions instead.

Let's say we want to ensure that students cannot enrol if they are under 16 years old. We need to implement a validator on our date_of_birth field.

From Pydantic, we will import the validator() decorator, and write a method to validate the date_of_birth field.

import uuid
from datetime import date, datetime, timedelta
from pydantic import BaseModel, confloat, validator

class Student(BaseModel):
    id: uuid.UUID
    name: str
    date_of_birth: date
    GPA: confloat(ge=0, le=4)
    course: str | None
    department: str
    fees_paid: bool

    @validator('date_of_birth')
    def ensure_16_or_over(cls, value):
        sixteen_years_ago = datetime.now() - timedelta(days=365*16)

        # convert datetime object -> date
        sixteen_years_ago = sixteen_years_ago.date()
        
        # raise error if DOB is more recent than 16 years past.
        if value > sixteen_years_ago:
            raise ValueError("Too young to enrol, sorry!")
        return value

The validator function is defined on lines 14-24, with the first argument of the decorator specifying the field(s) to which the function should be applied - date_of_birth in our case.

The source dataset does not have any dates of birth that would fail this validation, but you can add one after fetching the data, as below:

url = 'https://raw.githubusercontent.com/bugbytes-io/datasets/master/students_v1.json'
data = requests.get(url).json()
data.append(
    {
        "id": "48dda775-785d-41e3-b0dd-26a4a2f7722f",
        "name": "Justin Holden",
        "date_of_birth": "2010-08-22",
        "GPA": "3.23",
        "course": "Philosophy",
        "department": "Arts and Humanities",
        "fees_paid": 'true'
    }
)

If you run the full script with this data, this new record will cause a ValueError because the date-of-birth is more recent than 16 years ago!

With such custom validators, Pydantic allows very flexible and dynamic validation of your data fields. These validator methods are where complex field validation logic should occur.

Model Enum fields

Let's finish this example by defining an Enum in Python for our department field.

We're going to assume the college/university has a small set of three departments - Science and Engineering, Arts and Humanities, and Life Sciences.

Let's define a Python Enum:

from enum import Enum

class DepartmentEnum(Enum):
    ARTS_AND_HUMANITIES = 'Arts and Humanities'
    LIFE_SCIENCES = 'Life Sciences'
    SCIENCE_AND_ENGINEERING = 'Science and Engineering'

We can now use the DepartmentEnum as our type for the department field in the model:

class Student(BaseModel):
    id: uuid.UUID
    name: str
    date_of_birth: date
    GPA: confloat(ge=0, le=4)
    course: str | None
    department: DepartmentEnum
    fees_paid: bool

    @validator('date_of_birth')
    def ensure_16_or_over(cls, value):
        sixteen_years_ago = datetime.now() - timedelta(days=365*16)

        # convert datetime object -> date
        sixteen_years_ago = sixteen_years_ago.date()
        
        # raise error if DOB is more recent than 16 years past.
        if value > sixteen_years_ago:
            raise ValueError("Too young to enrol, sorry!")
        return value

On line 7, we've changed the type to our new enum.

Run this entire code, and it should work when converting each object from the JSON file on Github. You should now see all the type-conversions have occurred, and we have a UUID, a date and an DepartmentEnum in our model values.

Furthermore, all validations have run on our models, for type-conversion, but also for the GPA field with the confloat type, and for the date_of_birth with the custom validator function.

The full code is shown below - feel free to split the enum and Pydantic models into their own files to clean things up!

import uuid
import requests
from datetime import date, datetime, timedelta
from pydantic import BaseModel, confloat, validator
from enum import Enum 


url = 'https://raw.githubusercontent.com/bugbytes-io/datasets/master/students_v1.json'
data = requests.get(url).json()


class DepartmentEnum(Enum):
    ARTS_AND_HUMANITIES = 'Arts and Humanities'
    LIFE_SCIENCES = 'Life Sciences'
    SCIENCE_AND_ENGINEERING = 'Science and Engineering'


class Student(BaseModel):
    id: uuid.UUID
    name: str
    date_of_birth: date
    GPA: confloat(ge=0, le=4)
    course: str | None
    department: DepartmentEnum
    fees_paid: bool

    @validator('date_of_birth')
    def ensure_16_or_over(cls, value):
        sixteen_years_ago = datetime.now() - timedelta(days=365*16)

        # convert datetime object -> date
        sixteen_years_ago = sixteen_years_ago.date()
        
        # raise error if DOB is more recent than 16 years past.
        if value > sixteen_years_ago:
            raise ValueError("Too young to enrol, sorry!")
        return value

for student in data:
    # create Pydantic model object by unpacking key/val pairs from our JSON dict as arguments 
    model = Student(**student)
    print(model)

Let's finish this tutorial by seeing how to convert our Pydantic models to a dictionary, or to JSON data.

Exporting Model to JSON or Dictionary

Pydantic has a couple of helper methods to allow you to export your validated models to a dictionary, or to JSON.

Firstly, let's see how to convert the model to a dictionary. The model has a simple .dict() method that can do this.

For example, in the for-loop at the bottom of the above code, you can add the following:

for student in data:
    # create Pydantic model object by unpacking key/val pairs from our JSON dict as arguments 
    model = Student(**student)
    print(model.dict())

So, for each model we create, we then convert the model to a dictionary. This maintains all the new types and validations that our model class did, and simply dumps the output to a dictionary, which may be more convenient and performant in many cases.

As well as .dict(), models have a .json() method that dumps the data to a JSON string. This will convert the complex types back to primitives - for example, the UUID object will be converted back to a string (JSON can only work with a small number of primitive types, such as integers, floats, strings, booleans and lists, as well as other objects).

The date of birth will also undergo conversion from a date object back to a string.

for student in data:
    # create Pydantic model object by unpacking key/val pairs from our JSON dict as arguments 
    model = Student(**student)
    print(model.json())

Sample output is shown below:

{
   "id":"d15782d9-3d8f-4624-a88b-c8e836569df8",
   "name":"Eric Travis",
   "date_of_birth":"1995-05-25",
   "GPA":3.0,
   "course":"Computer Science",
   "department":"Science and Engineering",
   "fees_paid":false
}

After this serialization, the data could be output into a file, or sent across an API to another application.

Summary

This post has introduced some key concepts in Pydantic. We've learned how to create model classes, annotate fields with types, specify constraints on the values for fields, and how to denote a field as Optional.

We've also learned how to define custom validation functions for fields, using the @validator decorator.

Finally, we learned how to dump our models to dictionaries and JSON strings after we've used the models to process incoming (or potentially outgoing) data.

In the next post, we'll learn how to work with nested objects, how to create JSON schemas from our model definitions, and how to do more advanced things with Pydantic.

If you enjoyed this post, please subscribe to our YouTube channel and follow us on Twitter to keep up with our new content!

Please also consider buying us a coffee, to encourage us to create more posts and videos!