PostgreSQL Data Cleaning vs. Python Data Cleaning
In this tutorial, we will explore various techniques for cleaning data using PostgreSQL and TimescaleDB. We'll cover the following topics: 1. Adding columns together 2. Converting strings to numbers 3. Creating a view 4. Filtering rows with WHERE 5. Extracting parts of timestamps 6. Renaming values using PostgreSQL and Python 7. Filling in missing data with gap-filling functions 8. Ignoring missing data By the end of this tutorial, you should feel more comfortable with exploring some of the possibilities that PostgreSQL data cleaning and TimescaleDB data cleaning provide. You'll learn how to clean data directly within your database, which can save time in the long run compared to repetitive scripting tasks. Let's get started! 1. Adding columns together: To add two numeric columns together, you can use the + operator. For example, if you have a table called 'energy_data' with columns 'energy_reading_1' and 'energy_reading_2', you can create a new column called 'total_energy' by adding these two columns together: ```sql ALTER TABLE energy_data ADD COLUMN total_energy NUMERIC; UPDATE energy_data SET total_energy = energy_reading_1 + energy_reading_2; ``` In Python, you can use the '+' operator to add two columns together: ```python import pandas as pd # Assuming df is your DataFrame df['total_energy'] = df['energy_reading_1'] + df['energy_reading_2'] ``` 2. Converting strings to numbers: To convert a string column to a numeric column, you can use the TO_NUMBER() function in PostgreSQL or the map() function in Python. For example, if you have a table called 'energy_data' with a column called 'cost_str' containing costs as strings (e.g., '$10.50'), you can create a new numeric column called 'cost' by converting these strings to numbers: ```sql ALTER TABLE energy_data ADD COLUMN cost NUMERIC; UPDATE energy_data SET cost = TO_NUMBER(cost_str, '9999999999.99'); ``` In Python, you can use the map() function to convert a string column to a numeric column: ```python import pandas as pd # Assuming df is your DataFrame df['cost'] = df['cost_str'].map(lambda x: float(x.strip('$'))).astype(float) ``` 3. Creating a view: A view is a virtual table that contains the result set of an SQL SELECT statement. You can use views to simplify complex queries, restrict access to certain data, or encapsulate business logic. To create a view in PostgreSQL, you can use the CREATE VIEW statement: ```sql CREATE VIEW energy_view AS SELECT * FROM energy_data; ``` In Python, you can use the pandas DataFrame method to_sql() to create a view from a DataFrame: ```python import pandas as pd from sqlalchemy import create_engine # Assuming df is your DataFrame and engine is your database connection df.to_sql('energy_view', con=engine, if_exists='replace') ``` 4. Filtering rows with WHERE: The WHERE clause allows you to filter rows based on specific conditions. For example, if you want to select only the rows where 'day_of_week' is equal to 1 (Monday), you can use the following query: ```sql SELECT * FROM energy_data WHERE day_of_week = 1; ``` In Python, you can use boolean indexing with DataFrame conditions to filter rows based on specific conditions: ```python import pandas as pd # Assuming df is your DataFrame df_monday = df[df['day_of_week'] == 1] ``` 5. Extracting parts of timestamps: The EXTRACT() function allows you to extract specific parts of a timestamp, such as the hour or minute. For example, if you want to select only the rows where 'time' is between 7:00 p.m. and 8:00 p.m., you can use the following query: ```sql SELECT * FROM energy_data WHERE EXTRACT(HOUR FROM time) BETWEEN 19 AND 20; ``` In Python, you can use the pandas DataFrame method between_time() to filter rows based on specific parts of timestamps: ```python import pandas as pd # Assuming df is your DataFrame df_7to8 = df.between_time('19:00', '20:00') ``` 6. Renaming values using PostgreSQL and Python: Another valuable technique for cleaning data is being able to rename various items or remap categorical values. The importance of this skill is amplified by the popularity of this Python data analysis question on StackOverflow. The question states “How do I change a single index value in a pandas DataFrame?”. Since PostgreSQL and TimescaleDB use relational table structures, renaming unique values can be fairly simple using PostgreSQL data cleaning. When renaming specific index values within a table, you can do this “on the fly” by using PostgreSQL’s CASE statement within the SELECT query. Let’s say I don’t like Sunday being represented by a 0 in the day_of_week column, but would prefer it to be a 7. I can do this with the following query: ```sql SELECT type, time, usage, cost, is_weekend, -- you can use case to recode column values CASE WHEN day_of_week = 0 THEN 7 ELSE day_of_week END FROM energy_usage ``` In this case, python has similar mapping functions. ```python energy_df['day_of_week'] = energy_df['day_of_week'].map({0 : 'Sunday', 1 : 'Monday', 2: 'Tuesday', 3: 'Wednesday', 4: 'Thursday', 5: 'Friday', 6: 'Saturday'}) print(energy_df.head(20)) ``` 7. Filling in missing data with gap-filling functions: Another common problem in the PostgreSQL data cleaning process is having missing data. For the dataset we are using, there are no obviously missing data points. However, it's very possible that with evaluation, we could find missing hourly data from a power outage or some other phenomenon. This is where the gap-filling functions TimescaleDB offers could come in handy. When using algorithms, missing data can often have significant negative impacts on the accuracy or dependability of the model. Sometimes, you can navigate this problem by filling in missing values with reasonable estimates and TimescaleDB actually has built-in functions to help you do this. For example, let's say that you are modeling energy usage over individual days of the week and a handful of days have missing energy data due to a power outage or an issue with the sensor. We could remove the data or try to fill in the missing values with reasonable estimations. For today, let's assume that the model I want to use would benefit more from filling in the missing values. As an example, I created some data. I called this table energy_data and it is missing both time and energy readings for the timestamps between 7:45 a.m. and 11:30 a.m. ```sql SELECT --here I specified that the data should increment by 15 mins time_bucket_gapfill('15 min', time) AS timestamp, interpolate(avg(energy)), locf(avg(energy)) FROM energy_data --to use gapfill, you will have to take out any time data associated with null values. You can do this using the IS NOT NULL statement WHERE energy IS NOT NULL AND time > '2021-01-01 07:00:00.000' AND time < '2021-01-01 13:00:00.000' GROUP BY timestamp ORDER BY timestamp; ``` In Python, you can use the pandas DataFrame method fillna() to fill in missing values with a specific value or interpolate between existing values: ```python energy_test_df['time'] = pd.to_datetime(energy_test_df['time']) energy_test_df_locf = energy_test_df.set_index('time').resample('15 min').fillna(method='ffill').reset_index() energy_test_df = energy_test_df.set_index('time').resample('15 min').interpolate().reset_index() energy_test_df['locf'] = energy_test_df_locf['energy'] print(energy_test_df) ``` 8. Ignoring missing data: The following query shows how I could ignore the missing data. I wanted to include this to show you just how easy it can be to exclude null data. Alternatively, I could use a WHERE clause to specify the times I like to ignore (the second query). ```sql SELECT * FROM energy_data WHERE energy IS NOT NULL; SELECT * FROM energy_data WHERE time <= '2021-01-01 07:45:00.000' OR time >= '2021-01-01 11:30:00.000'; ``` PostgreSQL Data Cleaning Wrap-Up After reading through these various techniques, I hope you feel more comfortable with exploring some of the possibilities that PostgreSQL data cleaning and TimescaleDB data cleaning provide. By cleaning data directly within my database, I am able to perform a lot of my cleaning tasks a single time rather than repetitively within a script, thus saving me time in the long run. If you're looking to save time and effort while cleaning your data for analysis, definitely consider using PostgreSQL and TimescaleDB. In my next posts, I'll discuss techniques for transforming data using PostgreSQL and TimescaleDB. I'll then use everything we've learned together to benchmark data munging tasks in PostgreSQL and Python vs. pandas. The final blog post will walk you through the full process on a real dataset by conducting a deep-dive into data analysis with TimescaleDB (for data munging) and Python (for modeling and visualizations). If you have questions about TimescaleDB, time-series data, or any of the functionality mentioned above, join our community Slack, where you'll find an active community of time-series enthusiasts and various Timescale team members. If you’re ready to see the power of TimescaleDB and PostgreSQL right away, you can sign up for a free 30-day trial or install TimescaleDB and manage it on your current PostgreSQL instances. We also have a bunch of great tutorials to help get you started. Until next time!
Company
Timescale
Date published
May 23, 2024
Author(s)
Miranda Auhl
Word count
5492
Hacker News points
None found.
Language
English