Splitting URLs with in Data Cubes

Hi,

Working with URLs and splitting them to find the parts of the URL that we need. This was a question I asked a bit ago, made some progress, but am stuck again.

Looking to take out a number of characters before a certain web address.

Essentially the current web address I’m trying to split is: https://blah.website.com/segment1/segment2/segment3/unusableinformation1/unusableinformation2/

However, instead of just https://blah.website there could be more than one string of text and periods before website.com

They range from blah.website.com to blah.blah.website.com to blah.blah.blah.website.com

I’d like to make 5 separate URLs from this

  1. website.com/segment1/
  2. website.com/segment1/segment2/
  3. https://website.com/
  4. https://website.com/segment1/
  5. https://website.com/segment1/segment2/

This is currently the code I’m working with:

string url = $URL$;
try{
string[] urlSegments = url.Split(’/’);
if(urlSegments[4] != null){
string topURL = urlSegments[1] + urlSegments[2] + ‘/’ + urlSegments[3] + ‘/’;
}else{
string topURL = “#VALUE!”
}
}catch{}
return topURL;

Would need to split not only at the ‘/’ level, but also at the ‘.’ level, while taking into account the varying number of “blahs” before website.com

Normally in C# you’d use the Uri or UriBuilder classes to solve this sort of thing, but our Data Cube transform’s use a special script language that doesn’t appear to have these particular classes available (I will discuss possibly getting this added for the future).

However at this time I think you can accomplish what you want via regular expressions instead. This particular stack overflow post has a good Regex that can be used, and here’s more information on using Regex in C# (which is how you’d want to use it in the Data Cube).

As a simple hands-on example, I successfully used the following script in the Data Cube calculated element transform to get the main host part of the URL as a new column:

Regex abc = new Regex(@"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?");
Match match = abc.Match("https://blah.blah.blah.website.com/");
if (match.Success) {
 return match.Groups[4].Value; 
}