SQL basics
SQL Count
The Count function in SQL is used to count the number of records in a table returned in a query. It is an aggregate function which means it allows you to perform a calculation on a set of values to return a single scalar value- in this case, the count. Understanding the statistics behind data is key to success in any business- hence SQL Count may easily be one of the most frequently used SQL commands. If you are the owner of a private school for example you’d like to know how your school is doing by knowing your numbers well- how many students were enrolled in the advanced STEM classes this year, how many kids did well in the PE classes or how many struggled to keep their grades above average. SQL Count allows us to perform a multitude of such simple and complex operations.
I. SQL Count() Function
The SQL Count() function is commonly used within a SELECT statement with the default asterisk parameter (*). This counts all the columns for all records returned including all NULL and duplicate values.
Let’s assume we have a School Table with StudentId as the primary key uniquely identifying all students of a high school. This table has a column with name STEM that represents the students who are taking the STEM classes.
SELECT COUNT (*) FROM School
This will return a count of all records from the school table including any NULL student records.
To avoid NULL values which is often the case (valid old student records when STEM was not a part of the curriculum may show that column entry as NULL), the (ALL) parameter can be used.
SELECT COUNT(All)
FROM School
This will return a count of all the rows with non-NULL values from the same table.
II. SQL Count Distinct
For unique and non-NULL values the SQL Count Distinct clause is used where we need to specify the column name of the table for which you seek distinct values.
SELECT COUNT (Distinct STEMId) As “Number of STEM classes”
FROM School
This will result the number of unique STEM Sections offered in the school and will return the count under a heading “Number of STEM classes”.
Often our business requirements or functionalities are much more complex. For example for our school we may wish to count the number of STEM classes for students in class IX who obtained a Math score of 90 and above (assuming STEM classes are optional or offered on a per student basis). SQL Count allows the use of expressions to address such conditional clauses.
SELECT COUNT (Distinct StemId)
FROM School
WHERE ClassId=’IX’ and MathScore>=90
Here we again use SQL Count Distinct to make sure there are no duplicate values but put a WHERE clause to define the conditions for the output we need.
III. SQL Count with other Aggregate functions
Aggregate functions like SQL Max or SQL Sum can be used in addition to SQL Count to calculate more complex values. For example we may like to know the STEM class in which the grade IX students enrolled the most. In such a case we would count the no of students enrolled in each unique STEM class and then find the one with the Max.
SELECT MAX(mycount)
FROM (
SELECT StemId, COUNT (StemId) mycount
FROM School
WHERE ClassId=’IX’
GROUP BY StemId
)
SQL Where
The SQL Where Clause is used to return a list of records after applying a filter on rows and tables that fulfils a specific condition. It is used in combination with the SELECT statement. The condition in the Where Clause is usually an expression which is applied to table rows and columns and evaluated to True. Comparing a particular column value to a literal or searching for only non-Null records are commonly used search conditions.
Let’s take a student records database of an elementary school for example where we want to filter our search for kids who joined the school in 2018.
SELECT StudentName
FROM SCHOOL
WHERE YearJoined=2018
In this case the database queries all the rows of the Database table SCHOOL for the column name Studentname for which the YearJoined column value is 2018 and returns the particular list of records. We can replace the above condition by another one which searches for students whose age is less than 6 years. In that case we shall use the ‘<’ operator in the WHERE statement and obtain the needed results.
SELECT StudentName
FROM SCHOOL
WHERE Age<6
I. Multiple conditions- inclusive and exclusive
For filters more complex than these the SQL Where clause also allows the use of multiple conditions joined together using logical operators like AND and OR as in the SQL query below.
SELECT StudentName
FROM SCHOOL
WHERE YearJoined=2018 AND Age<6
This query finds rows from the same table now searching for Student names where both the conditions ‘YearJoined=2018’ AND ‘Age<6’ are met.
In the above examples we have searched for conditions that include a set of rows in the results list. It is also possible however to use expressions which can be used to exclude a number of rows matching a certain condition. For example may wish to find all students in the school except those whose parents are US citizens. The NOT operator allows this to happen.
SELECT StudentName
FROM SCHOOL
WHERE USCitizenParent NOT True
II. SQL Order of evaluation
Just like in simple logic any combination of AND, OR or NOT (or any other computational logical operators) can also be used for more complex filters and conditions. What is important to remember is that the mathematical order of operation evaluations for Computational Logic is still valid here and must be kept in mind when formulating a query. For example NOT operation is always evaluated before the AND and OR operation. Our query may end up in an error or not produce the desired list of records if we are not cautious about it. Also if there is no seemingly logical order, the database will compute the WHERE clause expressions in the order in which they are found- first one first. It is hence necessary to put the specific conditions before the general conditions like in the following example both for accuracy as well as efficiency.
SELECT StudentName
FROM SCHOOL
WHERE Lastname=’Thomson’ AND ParentIsUSCitizen=’True’
In this query the expression Lastname is evaluated and a temporary list of records meeting the condition is created. The database then evaluates the second condition ParentUSCitizenship as True in the temporary list of records. In the first pass the database has to scan the whole database but in the second pass it only looks at the already filtered list of records. Knowing how to order the query conditions can thus drastically reduce the operational burden on the database management system.
III. Range operators IN, BETWEEN
Sometimes we may like to find a set of values that lie within a certain range or where the value is one from a list of values. Operators IN and BETWEEN can be used for such cases within the Where Clause.
SELECT StudentName
FROM SCHOOL
WHERE LastName NOT IN
(Travolta, Edgar, Woods, Thompson)
This query searches all students whose last name is not any of ‘Travolta’, ‘Edgar’, ‘Woods’ or ‘Thompson’.
SELECT StudentName
FROM SCHOOL
WHERE Age BETWEEN 5 and 6
This query searches for all students whose age lies between 5 and 6 years.
IV. Pattern matching with LIKE
We can also evaluate an expression within a WHERE clause to match a certain pattern. Like if we want to find all students whose lastnames start with ‘Th’ such as in ‘Thames’ or ‘Thompson’ or ‘Tykes’ etc we can use the following query.
SELECT StudentName
FROM SCHOOL
WHERE LastName LIKE ‘Th%’
SQL Order By
SQL Order by clause is used to sort the rows returned in a results query. By default a SQL query returns results in no particular order or in the order in which they are found in the database. Sorting the query results not only makes them useful in a number of ways but also allows for further database functions to be performed on them more efficiently and effectively. Finding a scalar valued result like a maximum or minimum or the first value in a list of values is a frequently required functionality. This can be achieved by performing aggregate functions like SQL Max(), SQL Last() etc over the returned queried list of values.
For example you may be thinking of settling in US and may be browsing online to find articles that list the recent most best places to live. If I were you I would be wondering about the warmest cities in the east coast. If we had access to some public government datasets for city data we could create a SQL query to find out that answer by ourselves. The first thing we would need is the list of all cities in the east coast. Next, having them sorted by average monthly temperature would make the task of finding the warmest city really easy.
SQL Order clause by is always used after the SELECT, FROM and WHERE statements in a SQL query. To sort by a certain parameter all we need to do is specify the column name.
Select City, AvgTemp
From CITYDATA
Where Timezone=EST
The above query will return a list of cities in the east coast with two columns- city names and average temperatures in no particular order. We can sort the results by using Order By clause in the end as follows.
Select City, AvgTemp
From CITYDATA
Where Timezone=EST and Month BETWEEN Oct AND March
Order By AvgTemp
This query will show records sorted by AvgTemp in ascending order.
I. SQL Sorting Order
The default sorting order of the Order by clause is Ascending. To sort in descending order, we need to specify DESC term in the clause.
Select City, AvgTemp
From CITYDATA
Where Timezone=EST and Month BETWEEN Oct AND March
Order By AvgTemp DESC
The above query will result in showing a list of cities with the highest average temperature at the top.
We now may also want to look for cities with not only warmer temperatures but also with cheap rents. In this case we need to sort the list again by a new parameter-Rentvalue. Order by allows us to do multiple sorting operations in sequence which means we can sort a list already sorted by one parameter again by another parameter in the same or a different order.
Select City, AvgTemp, Rentvalue
From CITYDATA
Where Timezone=EST and Month BETWEEN Oct AND March
Order By AvgTemp DESC, Rentvalue
Note above that we did not specify any order for sorting by Rentvalue because we want to see the cities with the lowest rents and ascending order is already the default. It is also important to note that the columns we use for sorting need not be present in the list of records.
II. SQL Order by with Aggregate functions
Now all we want to know is which is the best city for living, we can rewrite the SELECT statement as just SELECT City and use the aggregate function SQL First with Order by to find the top city to live in US.
Select First (City)
From CITYDATA
Where Timezone=EST and Month BETWEEN Oct AND March
Order By AvgTemp DESC, Rentvalue
If like before we just wanted to find the city with the warmest temperature, we could use the function Max instead in the SQL Order by clause.
Select Max (AvgTemp)
From CITYDATA
Where Timezone=EST and Month BETWEEN Oct AND March
Order By AvgTemp DESC
III. Window functions
What we really did above was to find the best one for us to live in the country. It would be nicer if we could rank a list of cities by the same criteria so we could for example pick the top five and take our time to decide on one. SQL already has functions that allow us to rank results given a certain criteria. Rank(). Dense_rank() and Row number() are window functions that can be used to rank rows based on the ordering of rows by a certain condition. SQL Order by is used mandatorily with the Rank() function because it needs the results sorted in a certain order.
Select City, Rank() OVER (Order by Avgtemp DESC)
From CITYDATA
Where Timezone=EST and Month BETWEEN Oct AND March
Order By AvgTemp DESC
This will result in a list of records that give a list of cities as well as their ranks based on avg temperature in descending order. If two cities have the same rank, say 2nd , the rank() function gives both records rank ‘2’ and then skips the rank ‘3’ completely for any city which means the next city will have the rank ‘4’. Function Dense_rank() can be used if we don’t wish to skip ranks like that.
Row_number() is another ranking function very similar to Rank() except that does not allow sharing of ranks between rows. For example in the above case even if two cities had the same rank ‘2’ given by function rank(), row_number() will give rank ‘2’ and ‘3’ to these cities. The syntax of Row_number and Dense() are exactly the same as Rank().
SQL Group By
SQL Group By is used for analyzing characteristics of groups of data in a table that are similar in some way. An example can be city data or demographic information identifying a group of people by some common characteristics like ethnicity, age, etc. We may also like to perform further operations on that data so it can give us more crystalized information like Average() or Sum(). Any operation like this needs a Group By clause at the end of the SQL SELECT statement.
It may be helpful to visualize the result of a Group by operation as rolling up the rows of data identified by a group into one single row. It’s like all data from the individual rows is now packed into a single row and represents a single data instance. Let’s understand this better by exploring an example.
Let’s assume we have a database that contains city data for all cities for the state of VA. To see county level information like the address of the main county office, Group By can be used to group the data from the CityData table first by city and then by county. Having the OfficeAddress column in the SELECT statement in addition to City and County helps access that information.
SELECT City, County, OfficeAddress
FROM CityData
Group By City, County
City | County | OfficeAddress(ResidentId) |
---|---|---|
Roanoke | Washington | 200 Main St, Suite #100, Roanoke, VA 24001 |
Roanoke | Warrenton | 5040 Hemington St, Roanoke, VA 24095 |
Charlottesville | Albemarle | 12 Light Water St, Charlottesville VA 22911 |
Charlottesville | Charles | 20 Berkeley Ct, First Floor, Charlottesville VA 22900 |
I. Group operations
Mathematical functions performed on sets of table rows to return a single scalar result per set are called group functions or aggregate functions like AVG, COUNT etc. Group By clause is often used with aggregate functions to perform such operations.
The following SELECT statement returns a query from table City of a Citywide database grouping people by ethnicity counting the no of people per ethinicity.
SELECT Ethnicity, Count(ResidentId)
FROM City
Group By Ethicity
If there are 2000 hispanics and 4500 American Indians for example, the results query will show the result as below.
Ethnicity | Count(ResidentId) |
---|---|
Hispanic | 2000 |
American Indian | 4500 |
Latino | 150 |
II. Rollup operator in Group By
Using the Rollup operator within Group By clause allows multiple levels of aggregation within a single operation with the levels in a hierarchy. In the first example we performed a Group By operation both by city and county to see county level information. On applying the Rollup operator we will get aggregated results performed on each level of hierarchy.
SELECT City, County, Sum(ResidentId)
FROM CityData
Group By(City, County) With Rollup
City | County | Sum(ResidentId) |
---|---|---|
NULL | NULL | 47609 |
Roanoke | NULL | 24000 |
Roanoke | Washington | 4000 |
Roanoke | Warrenton | 20000 |
Charlottesville | NULL | 23609 |
Charlottesville | Charles | 16566 |
Charlottesville | Albemarle | 7043 |
The Rollup operator is a window function in SQL and extremely efficient providing answers at multiple levels. It’s result is similar to the union of results of the same SELECT statement with no grouping, Group By City, Group By County and Group By City, County.
III. Using the Having Clause
The Having Clause is used to further filter the datasets returned by Group By by adding any additional condition/expressions. It’s similar to the WHERE clause except that it is used on an already filtered and returned row-set instead of the individual datasets of a table. It is used after the Group By clause in a SELEC statement.
The following query returns a dataset after grouping data in the table City by ethnicity where the no of people per ethinicity is at least 201 (excluding Latinos in this case since the count was 150 as in the above example).
SELECT Ethnicity
FROM City
Group By Ethicity
Having Count(ResidentId)>200
Ethnicity |
---|
Hispanic |
American Indian |