Hi everyone, welcome to my Alteryx made easy series. My name is Quyen. I am currently a data analytics consultant at the Data School. I have just finished my first week of training in Alteryx. After my last 3 days of training, I have learned a lot of useful skills in data pre-processing using Alteryx. Therefore, I decided to share what I have learned from my wonderful coaches with you so you could make your data preparation steps easier and get it ready for deeper analysis later.
Often the data we use for analysis had been cleaned and tabulated with columns and rows. However, in reality, data is not always that clean and tidy and you have to prepare it yourself before your analysis. According to a survey from CrowdFlower, Data scientists spent 80% of their time preparing the data and 76% of them view data pre-processing as the least enjoyable job.
Before I know Alteryx, I was very scared of messy data especially the one that comes in text format having inconsistent structures with extra white spaces, tabs, and punctuations. However, during my last 3 days of training at the data school, my coaches Sean and Natalia have created some wow moments for me in how could Alteryx tackle all the tricks easily. If you do not know what Alteryx is, please download the software and have a play around or find some basic tutorials about the software before following my tutorial. If you have already known Alteryx enjoy my tutorial.
The data I am using today has .txt format from USDA website which is about crop progress and condition. I have included the link below for you. The data comes in with multi headers and after loading it in Altery it just has one column with lots of text like the picture below. Our task here is to turn that text into rows and columns that are ready for analysis later. Stay with me, I will make it easy for you to follow.
I would like to divide my data preparation into 2 main steps, the initial diagnosis of the data and data cleansing. As you could see, the data has many problems that we have to identify before we could go any further.
STEP 1 – INITIAL DIAGNOSIS
- It just has 1 column of 219 records that contain headers and values.
- There are lots of punctuation and extra white spaces. We could identify these through the thin red bar on top. The little red triangles indicate what specific rows need to be cleaned.
- There are null values, we could identify this through the yellow thin bar on top and the green one means the data is OK.
- Row 47 and row 48 contain the date. December 8, 2019, December 1, 2019, and December 8, 2019. We just need the current date which is in the third column.
- Row 5 contains the report released date December 9, 2019, which is one day after the current date of the data.
- Row 9, row 42, and row 61 contain the crop names and their progress.
- The values were recorded for each state and all state average. What I need for my analysis is the state average only. Therefore I will pay attention to row 38, 57, and 87. They all begin with a number then –States.
- From row 88+, there is just information about the report and its methodology which we do not need for our analysis. Yay, We have done the hard work, now let’s get ready for parsing our data.
STEP 2 – DATA CLEANSING
1. After downloading the data, dragging it into the Alteryx workflow with no delimiter (\0), and run it, we will have something like this in our result window.
2. Dragging the data cleansing tool under the Preparation tool pallet into our workflow and connect it to the Input tool by dropping it next to the Input icon and they will be connected automatically.
3. Generate [Crop ID] to separate each crop from another. Dragging the Multi-Row Formula under the Preparation palette to the canvas and create the ID for each crop. 1 for corn, 2 for Sunflowers, and 3 for Cotton.
IF Contains([Field_1], ” – Selected States”) THEN [Row-1:Crop ID] + 1 ELSE [Row-1:Crop ID] ENDIF
4. Generate [Row ID] with 1 for selected rows, 2 for selected Date row, and 0 for unwanted rows. Dragging another Multi-Row Formula into the canvas and follow the below picture for what to put in the configuration window.
IF Substring([Field_1],3,6) = ‘States’ OR Substring([Field_1],2,6) = ‘States’ OR Contains([Field_1], ” – Selected States”) THEN 1 ELSEIF Getword([Field_1], 0) = ‘Released’ THEN 2 ELSE 0 ENDIF
5. Filter out the unwanted rows with the Filter tool under the preparation palette. Press on the arrow with the letter T to see the data after being filtered.
6. Separate crop data and date data using the Filter tool again. By connecting the new filter tool into the T anchor of the previous filter tool. Now our data is separated into Date and Crop and they will be appended back later. The T anchor is Date data the F anchor is Crop data
7A. Parsing the [Date] from the text. Under Formula Expression type this syntax (\u\l+\s\d+),(\s\d\d\d\d),
8A. Cleansing the [RegExOut2] column as it contains extra white spaces.
9A. Combine the two RegExOut columns to create the Report Date in the right format for parsing.
10A. Converting the report date into the month dd, yyyy format so we can subtract later
11A. Using the Formula tool to subtract 1 day from the report day like what we identify before. Under Select Column drop-down click +Add Column and type Week End into the rectangle with this formula: DateTimeAdd([Date time out],-1,’days’)
7B. Connecting the Multi-Row formula into the F anchor of the filter tool to separate the column header and value. The new column will be named [Column Labels] with this formula: if [Crop ID] != [Row-1:Crop ID] then 1 else 2 endif
8B. Using the Crosstab tool under Transforming pallet to separate the crop names and values into 2 different columns.
9B. Parsing the value in column  into 7 columns using \s delimiter.
10B. Using the formula tool to get [Crop Name]: GetWord(, 0) and [Crop Progress]: GetWord(, 1) from Column 
11B. Now select our wanted columns by using the select tool under the Preparation pallet.
12. Appending the two data sets together and we have done.
Thank you for following along with me. I hope you enjoy the tutorial and please share your approach with me so I could also learn from you.
Link to Data:
Icon: <div>Icons made by <a href=”https://www.freepik.com” title=”Freepik”>Freepik</a> from <a href=”https://www.flaticon.com/” title=”Flaticon”>www.flaticon.com</a></div>