The Curious Dev

Various programming sidetracks, devops detours and other shiny objects

Jan 20, 2013 - 4 minute read - Groovy Regex

Simple Data Collectors

The city that I live in, Perth, is sprawled out over a large distance and so the BOM’s temperature for Perth isn’t entirely accurate. Enter the Ag Dept’s temperature sensors here or more specifically, the one for Wanneroo here. These sensors are updated every minute (most of the time) and provide a more accurate measure of what the weather is doing right now. Rather than visiting the page manually, I wrote a script to grab the temperature for use elsewhere, such as on a dashboard.

I’ve previously written something similar that extracted the Perth value from the BOM which delightfully provide a JSON feed. So I rolled them in together and maybe I’ll use them both in the future, maybe to show how effective our afternoon sea-breeze is the closer one is to the coast :)

The BOM script

This script simply grabs the text of the given URL and parses it with the (as of Groovy 2.0) JsonSlurper (note the explicit import).

//ExtractTemperatures.groovy
import groovy.json.*

class ExtractTemperatures {
	def processBOM(Expando site) {
		//create a URL and get the text from it
		def content = new URL(site.source).text

		//create a new jsonSluper and parse the json data
		def jsonSlurper = new JsonSlurper()
		def jsonData = jsonSlurper.parseText(content)
		
		//extract just the entry we want
		def latestEntry = jsonData.observations.data[0]
		
		println "\n==== BOM ${site.name} ===="
		println "airTemp = ${latestEntry.air_temp}"
		println "windSpeed = ${latestEntry.wind_spd_kmh}"
		println "windDirection = ${latestEntry.wind_dir}"
		
		return latestEntry.air_temp
	}
	
	//snip...
}

The BOM have provided JSON feeds for most of their data, there are links for most of the BOM station sites at the bottom of each station specific page, see Perth’s here.

The Ag. Dept. script

The WA Ag. Dept. have many stations but I’m only really interested in the Wanneroo one as that is quite close to my location. The script simply grabs the text from the specified page and then through 6 specific regex queries the Temperature, Wind Speed and Wind Direction are extracted.

//ExtractTemperatures.groovy
import groovy.json.*

class ExtractTemperatures {
	//snip...
	
	def processAgDeptWithRegex(Expando site) {
		//generic URL values
		String agDeptPrefix = "http://agspsrv34.agric.wa.gov.au/climate/livedata/"
		String agDeptsuffix = "webpag.htm"
		
		//retrieve web page text for the particular site
		String sourceText = new URL(agDeptPrefix + site.source + agDeptsuffix).text
		
		//extract all lines with the "Yellow" formatting, then just get the first one (which we know is the correct one) for Temperature.
		String firstYellowRow = sourceText.findAll(/<td width="68"><font face="Courier New"[\s]?color="yellow">[\s]?(?:[0-9]+\.[0-9]?)[<\/font>]?<\/td>/)[0]
		String currentTemp = firstYellowRow.findAll(/[0-9]+\.[0-9]/)[0] //just get the "decimal" element
		
		//extract all lines with the "Yellow" formatting, then just get the sixth one (which we know is the correct one) for Wind Speed.
		String windSpeedElement = sourceText.findAll(/<td width="68"><font face="Courier New"[\s]?color="yellow">[\s]?(?:[0-9]+\.[0-9]?)[<\/font>]?<\/td>/)[5]
		String currentWindSpeed = windSpeedElement.findAll(/[0-9]+\.[0-9]/)[0] //just get the "decimal" element
		
		//extract all lines with the "Yellow" formatting, then just get the one with ENSW variations in it for Wind Direction.
		String windDirectionElement = sourceText.findAll(/<td width="68"><font face="Courier New"[\s]?color="yellow">[\s]?(?:[ENSW]+)[\s]?[<\/font>]?<\/td>/)[0]
		String currentWindDirection = windDirectionElement.findAll(/(?!=ow">[\s]?)[ENSW]+[\s]?(?=<\/)/)[0] //just get the element value
		
		println "\n==== Ag. Dept. ${site.name} ===="
		println "airTemp = ${currentTemp}"
		println "windSpeed = ${currentWindSpeed}"
		println "windDirection = ${currentWindDirection}"

		return currentTemp
	}
}

Regex is like black magic at times and for this script was aided greatly by the great software Regex Buddy, well worth the small purchase price of $40ish. No doubt those huge regexes to extract the exact line in the sourceText could be a little tighter, they’re long mainly because of the matching of exact strings, but they work and are not noticably slow so I’m happy.

Some regex tips from the above code:

  • [\s]? optionally find a space
  • [0-9]+ find at least one digit, to many
  • [0-9]+? optionally find at least one digit, to many i.e. zero to many.
  • (?!=ow”>[\s]?) negative-lookahead, the start of your search with the text ow”> (as in color=“yellow”), with an optional space. The wrapping ( and ) are the important bits here.
  • [ENSW]+ find one to many characters in ‘E’, ‘N’, ’S’ or ‘W’ allowing for combinations like SSW or SW or just E.
  • (?=<\/)/) positive-lookahead, the end of your search with the text </ (as in the closing “td” tag for the data line). As in the negative-lookahead above, the wrapping ( and ) are important here.

Bringing it all together

Finally, I’ve just written a little script to pull these two other functions together and dropped it into a thread with an infinite loop that sleeps for 60 seconds after executing, so that we can keep on getting this data rather that just once.

//RunWeatherCollectors.groovy
g = new ExtractTemperatures()

def th = Thread.start {
	while (true) {
		println "\n\n ${new Date().toString()}"
		
		def bomPerth = new Expando(source: "http://www.bom.gov.au/fwo/IDW60901/IDW60901.94608.json", name: "Perth")
		g.processBOM(bomPerth)
	
		def agWanneroo = new Expando(source: "wn", name: "Wanneroo")
		g.processAgDeptWithRegex(agWanneroo)
		
		sleep 60000
	}
}

Code

The above code examples are here