Parsing HTML data

Sprigo · April 11, 2015, 6:27am

I’m working on a project for the boss (wife) which will provide local tide information.

The data comes from the UK’s Hydrographic website but they don’t provide an API so I’m currently extracting the information using lots of string.IndefOf etc. Although it’s working and I am able to extract 7 days of data and associated graphics I was wondering if there’s a more elegant solution to parsing HTML pages?

Jason · April 11, 2015, 9:26am

@ Sprigo - I’ve done something similar for a winForm app using Regex.

http://www.regexbuddy.com/csharp.html

I also recommend RegEx Buddy for testing your scripts:

devhammer · April 11, 2015, 10:21am

@ Sprigo - Another thing to think about is maintainability. If this is a project that will be deployed for longer than a few weeks or months, then it might be worth building a web API wrapper around the parsing code, and have your device call that, rather than do the parsing directly.

That way, if/when the format of the page changes, you can fix the parsing in the API, rather than having to fish out and redeploy to your device.

An ASP.NET Web API on the free tier of Azure Web Apps would work well for something like that.

Sprigo · April 11, 2015, 12:03pm

@ Jason - That looks like a useful tool. Many thanks

@ devhammer - I was also concerned about the maintainability but hadn’t though about a web API wrapper. Many thanks for those pointers.

devhammer · April 11, 2015, 12:45pm

@ Sprigo - The sad truth is that any time you’re screen-scraping, you’re going to need to update eventually. So I figure why not make the updates behind a stable interface?

Glad you found the tip helpful.

Mr_John_Smith · April 22, 2015, 3:00am

http://systemhtml.codeplex.com or [url]http://htmlagilitypack.codeplex.com/[/url]

SystemHtml, converts a page to xml, and allows you to query it with xpath. But everyone uses HAP anyways