Tutorial

Normalized Database for Querying I90 Files from REE

Build a normalized database to efficiently query and analyze historical energy programming data from the Spanish electrical system. Learn database design, normalization, and optimization for energy market analysis.

Imagine you are an energy trader and want to analyze the energy management of the most important units in the Spanish electricity market, those that have suffered the most curtailment. Ideally, you would have the historical information of the energy programming units in a database (DB) to be able to query and analyze it efficiently.

For example, you would like to query which have been the 5 units that have suffered the most curtailment after the day-ahead market in each month of the year 2023. First, you download the operational information from a database and then analyze it.

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine(f'postgresql://{USER}:{PASSWORD}@{URL}:{PORT}/{DATABASENAME}')
df = pd.read_sql_table(query, engine)

idx = df.groupby('unit').energy.sum().nsmallest(20).index
(df
 .query('unit in @idx')
 .pivot_table(
   index='unit',
   columns='month',
   values='energy',
   aggfunc='sum'
   )
)
Graph of units with the most curtailment in the Spanish electricity market
Graph of units with the most curtailment in the Spanish electricity market

Why do you need a database to analyze the information from the I90 files? We explain it below.

I90 files context

Introduction

As explained in this tutorial, the I90 files from REE provide daily information on energy programming units in Spain according to generation programs.

Limitations of public servers

To work with all the historical information, we will download all the I90 files from REE through their website, from the first available date to the current date. Once the search is resolved, we click on the I90 file to download the zip.

I90 file search interface on the REE website
I90 file search interface on the REE website

The first failure we encounter when requesting such a wide range of data is that the servers cannot support the amount of data to be sent and crash, showing an error.

{
   "Status": 500,
   "message": "Internal server error"
}

For each day, REE should return a zip with as many Excels as there are days in the requested period.

Uncompressed Excel files from the I90 file
Uncompressed Excel files from the I90 file

If we also have to filter the information by market subjects among all the Excels, we would not only have to download the data but also filter it over each Excel.

On the other hand, if we develop a database to store the information in a structured way in a single table where we categorize the data by market subjects, programming units, generation programs, and result type, we could query the information much more efficiently.

For example, if we request all the operational information of the Iberdrola units with code IBEG, it would take less than two seconds.

Quick query of operational information of Iberdrola units
Quick query of operational information of Iberdrola units

We spend more time preprocessing the Excels for each new analysis than developing the database and dumping the most updated information every day.

Process to develop DB system

Normalize data structure

There are many Excels, each with similar but not identical columns. Additionally, the values are sometimes in MWh and others in €/MWh. The I90 data from REE are from generation units, each sheet representing a different program and result type.

The Excel information is not normalized. That is, the records are not in a single table differentiated by columns that classify the observations. Instead, the records are in different sheets according to the program and result type. For example, the I90DIA03 sheet only contains the result of the resolution of restrictions in the day-ahead market, in MWh.

Excel sheet I90DIA03 with energy results in MWh
Excel sheet I90DIA03 with energy results in MWh

If we wanted to look at the price, in €/MWh, that the units would charge for these restrictions, we would have to go to the I90DIA09 sheet.

Excel sheet I90DIA09 with prices in €/MWh
Excel sheet I90DIA09 with prices in €/MWh

Using the domain knowledge provided by the Official State Gazette on the operation of the electrical system in Spain, we can transform all the Excel sheets into a single table that contains the information classified as follows:

Normalized data format in a single table
Normalized data format in a single table

Each record represents a programming unit in a generation program, which has been adjusted in a certain way by the market and system operators, OMIE and REE, respectively, to generate, or not, a certain amount of energy at a specific price.

Design DB schema

Once the relationships between the tables are structured, we can define the DB schema following the relational model and applying techniques such as normalization to avoid information repetition. For example, programming units have an associated technology, which can be hydraulic, wind, solar, etc.

To avoid duplicating the technology type information in the operational data table, we create a specific table for the programming units. This table will contain the information of each programming unit and its associated technology, separate from the operational data table.

Relational database schema for I90 data
Relational database schema for I90 data
  • unit: information of the programming units, such as the technology they use to generate energy.
  • configuration: configuration of the generation programs, such as the type of restriction applied.
  • program: information of the generation programs.
  • operational: operational information of the programming units (energy and price according to program, unit, and operational conditions).

Rent DB

Once the DB schema is designed, we can rent a cloud DB to connect to it from any client (Python, Excel, Power BI, etc.)

In our case, we have rented a PostgreSQL DB for $15 a month on Digital Ocean. Initially, it is not advisable to over-rent resources (memory, CPU, etc.) to avoid spending money on things we do not need. If we need more resources, we can scale the DB in question with a click.

Once the DB is created, they provide us with the credentials that we will use later to connect to it.

Database rental interface on Digital Ocean
Database rental interface on Digital Ocean

Create tables in DB

To create the tables in the DB, we must first define the schema of the tables in a file. In our case, we have defined the schema in a models.py file that contains the structure of the tables, with columns and data types.

Now we create the connection to the DB with the credentials provided by Digital Ocean.

from sqlalchemy import create_engine
engine = create_engine(f'postgresql://{USER}:{PASSWORD}@{URL}:{PORT}/{DATABASENAME}')

Finally, we create the tables from Python.

from models import Base
Base.metadata.create_all(engine)

All good so far, but the tables are empty. How do we insert the data? How do we query it?

Interact with DB

Insert records in DB

By structuring the information in different tables to avoid repeating information, we must segment the normalized Excel into different files to insert them into the corresponding tables of the DB.

Therefore, we segment the normalized Excel into different tables and export them to new Excel files.

segment_table(entity='unit').to_excel('unit.xlsx')
segment_table(entity='program').to_excel('program.xlsx')
...
segment_table(entity='operational').to_excel('operational.xlsx')

Again, with Python, we can insert the data into the DB directly from the Excel files, using the pandas library.

import pandas as pd

df = pd.read_excel('unit.xlsx')
df.to_sql('unit', engine)

df = pd.read_excel('program.xlsx')
df.to_sql('program', engine)

...

df = pd.read_excel('operational.xlsx')
df.to_sql('operational', engine)

It was not necessary to export the data to Excel because we already had it in Python. However, it is good practice to separate processes to avoid, in the future, mixing data if we use the same code.

Request data from DB

We can now request data from the DB. For example, let’s see the data of some programming units of Iberdrola that contain IBEG in their name.

SELECT *
FROM unit
WHERE name LIKE '%IBEG%'
SQL query result for Iberdrola units
SQL query result for Iberdrola units

At the moment, the database is very poor because it has just enough information to relate the operational data of the programming units with the generation programs. If we want to add information according to the type of technology, we must extend the DB schema.

Extend DB schema

The information of the programming units, such as the technology or the market subject that manages it, is on the REE website.

Information of programming units on the REE website
Information of programming units on the REE website

To insert this information into the database, we first extend the schema of the units table with the new columns in models.py and update the DB.

Base.metadata.create_all(engine)
Database schema extension with new columns
Database schema extension with new columns

We download the data from the REE website, preprocess it according to the DB schema, and insert it into the same.

df = pd.read_excel('units_ree_I90.xlsx')
df.to_sql('unit', engine, if_exists='replace')

Now, we will not only have more information about the programming units, but we will also be able to request the information of the programming units according to the market subject. For example, let’s see which programming units are managed by the market subject IBEG, belonging to Iberdrola.

SELECT *
FROM unit
INNER JOIN market_agent
   ON unit.market_agent = market_agent.id
WHERE market_agent.name = 'IBEG'
SQL query for units managed by Iberdrola
SQL query for units managed by Iberdrola

We can also use the market subject to request the operational information of the programming units.

SELECT
    unit.name AS unit_name,
    program.name AS program_name,
    operational.*,
    configuration.*
FROM operational
INNER JOIN unit 
    ON operational.unit_id = unit.id
INNER JOIN market_agent 
    ON unit.market_agent_id = market_agent.id
INNER JOIN configuration 
    ON operational.configuration_id = configuration.id
INNER JOIN program
    ON configuration.program_id = program.id
WHERE market_agent.name = 'IBEG'
SQL query with INNER JOIN for operational information
SQL query with INNER JOIN for operational information

Share DB access

If you work in a team, you will need to share the information with them. How can you give them access to the information?

Credentials

By passing them the DB credentials and indicating how to access through an SQL client, like Python.

from sqlalchemy import create_engine
engine = create_engine(f'postgresql://{USER}:{PASSWORD}@{URL}:{PORT}/{DATABASENAME}')

Query DB with Pandas in Python

As well as querying the DB with the pandas library: you define the query and execute it with the pd.read_sql function, connecting to the DB with engine.

query = """
SELECT
   *
FROM
   unit
WHERE
   market_agent = 'IBEG'
"""

pd.read_sql(query, engine)

What if my colleagues do not handle Python?

Other programs to query DB

The DB is hosted on a URL that anyone with the credentials can access. Therefore, there are many programs that, given the credentials, can connect to the DB and execute queries. Here are some examples.

Improvements

Like any company, not everyone can access the database. Therefore, we need to protect the information and ensure that only authorized users can access it. Here are some of the improvements we can apply.

  • User and permission management
  • Restricted connection through VPN
  • Continuous execution to update the DB
  • API to query data without needing to know SQL
  • Migration management when modifying DB schema
  • Schema optimization for faster queries
  • Manipulate DataFrame with faster libraries

With programming, anything is possible. You just need a technology partner to help you develop your custom system.

Conclusions

  1. Centralized and Updated Access: All historical information from REE is in one place, eliminating the need to send Excels by email that get lost or become outdated.
  2. Speed and Efficiency: Queries are performed in seconds, even for large volumes of data, without relying on manual processes.
  3. Integration with Advanced Tools: Compatible with Python, Power BI, Tableau, and other tools, allowing analysis, visualization, machine learning, and artificial intelligence.
  4. Centralized Collaboration: Facilitates teamwork by ensuring uniform and secure access to information, without duplication or unsynchronized version issues.
  5. Automation and Error Reduction: Automates repetitive tasks like cleaning, normalization, and data insertion, optimizing time and minimizing human errors.
  6. Scalability and Flexibility: The system grows with business needs, allowing resources to be expanded without interruptions or limitiations.
  7. Preparation for Advanced Analysis: Offers an ideal structure to implement predictive techniques and advanced algorithms, aiding in strategic decision-making.