Tomorrow, we have our first official SQL training at the Data School. Therefore, I thought it would be good practice for me to write a blog introducing the basics of Structured Query Language (SQL) as a refresher. There are a few types of SQL from PostgresSQL, MySQL, Oracle etc.

Before I became a Data Analyst, I looked at job postings and blogs, and subreddits, and a theme began to emerge. SQL seemed to be pretty important. I took a course and passed the SQL Fundamentals via Datacamp and managed to grasp most of the concepts. It is not that complicated really.

 

The Foundations

All SQL queries follow roughly the same structure at their base.

SELECT
FROM

Select is what you want and from is where you want it from.

Example:

SELECT *
FROM `bigquery-public-data.new_york_trees.tree_species`

This returns all the fields from the selected database about trees species in New York. The ” * ” represents the wildcard ‘all’.

If I only want to return certain fields I would include them in my SELECT statement.

SELECT species_common_name,fall_color, tree_size
FROM `bigquery-public-data.new_york_trees.tree_species`
Would return only the following fields:

 

WHERE?

The WHERE command is when we start to filter data from the database table.

e.g.

SELECT species_common_name,fall_color, tree_size
FROM `bigquery-public-data.new_york_trees.tree_species`
WHERE tree_size= 'Large (Mature Height > 50 ft)'
Will return the results that are only large trees (not medium or small etc.) This can also be combined with AND / OR to add further conditions to the filter. So lets add Yellow to the filters like so.
SELECT species_common_name,fall_color, tree_size
 FROM `bigquery-public-data.new_york_trees.tree_species`
WHERE tree_size= 'Large (Mature Height > 50 ft)' AND fall_color= 'Yellow'

To add to the functionality of the WHERE clause you can use OR which returns one value OR another (Surprise surprise). Lastly you can add a NOT command to the query to select all records except the ones that match the condition. Like below.
SELECT species_common_name,fall_color, tree_size
 FROM `bigquery-public-data.new_york_trees.tree_species`
WHERE NOT tree_size= 'Large (Mature Height > 50 ft)' AND fall_color= 'Yellow'
This would return all records for the selected columns where the colour is Yellow and the tree size is anything but Large.

 

GROUP BY, ORDER BY

Group By = This let’s you summarise the information and you get a count of how many records fit in each group.
SELECT COUNT(fall_color), species_common_name
FROM `bigquery-public-data.new_york_trees.tree_species`
GROUP BY species_common_name
This query counts the different colours of trees and groups them by their common name.
Order By = Does would it says on the tin. Orders the data in a user defined way.
SELECT  species_common_name
FROM `bigquery-public-data.new_york_trees.tree_species`
ORDER BY species_common_name desc
So, this is ordering the species names in a descending order. If there is no desc it defaults to ascending alphabetical order.

 

HAVING

This is used when there are aggregate functions and can’t use the WHERE clause anymore due to the row-by-row nature of the WHERE filters.

SELECT  COUNT(growth_rate), growth_rate
FROM `bigquery-public-data.new_york_trees.tree_species`
GROUP BY  growth_rate
HAVING COUNT(growth_rate)>20
So, count(growth_rate) is the aggregate and having filters by those above 20. Which leaves us with only medium growth rate trees.
So, there you have it a crash course in basic SQL. I should at least remember some of it for tomorrow now.

 

The Data School
Author: The Data School