Parsing HTML data

I’m working on a project for the boss (wife) which will provide local tide information.

The data comes from the UK’s Hydrographic website but they don’t provide an API so I’m currently extracting the information using lots of string.IndefOf etc. Although it’s working and I am able to extract 7 days of data and associated graphics I was wondering if there’s a more elegant solution to parsing HTML pages?

1 Like

@ Sprigo - I’ve done something similar for a winForm app using Regex.

http://www.regexbuddy.com/csharp.html

I also recommend RegEx Buddy for testing your scripts:

3 Likes

@ Sprigo - Another thing to think about is maintainability. If this is a project that will be deployed for longer than a few weeks or months, then it might be worth building a web API wrapper around the parsing code, and have your device call that, rather than do the parsing directly.

That way, if/when the format of the page changes, you can fix the parsing in the API, rather than having to fish out and redeploy to your device.

An ASP.NET Web API on the free tier of Azure Web Apps would work well for something like that.

4 Likes

@ Jason - That looks like a useful tool. Many thanks

@ devhammer - I was also concerned about the maintainability but hadn’t though about a web API wrapper. Many thanks for those pointers.

@ Sprigo - The sad truth is that any time you’re screen-scraping, you’re going to need to update eventually. So I figure why not make the updates behind a stable interface? :slight_smile:

Glad you found the tip helpful.

http://systemhtml.codeplex.com or [url]http://htmlagilitypack.codeplex.com/[/url]

SystemHtml, converts a page to xml, and allows you to query it with xpath. But everyone uses HAP anyways :expressionless:

1 Like