Abstract
The GPS data of all stops (bus/tram/train etc.) which legally fall under the jurisdiction of VRN region, are an important piece of ingredient for a sound and robust implementation of off-line checkout faculty within the VRN-eTarif app. To facilitate the implementation rnv agreed to provide SevenRE biannually a list of stops that they procure from the VRN. A sample of such data was provided to us. In the following we do an analysis of that data and make some qualitative observation about the data.

Parsing the Excel data
Loading our analysis package and parsing the excel data to generate a JSON data.
SetDirectory[NotebookDirectory[]];
<< "VRNParseSync.wl"
JSONfilename =
ParseStopsVRN[
FileNameJoin[{ParentDirectory[NotebookDirectory[]],
"VRN-Haltestellen_gefilter MA und LU.xls"}], "Output" -> "JSON",
"JSON" -> "Short"
];
Get an optimized dataset from the JSON so efficient searches can be performed. Below is the meta data extracted from the given excel file.
xlsData = (Dataset@Import[JSONfilename, "RawJSON"]);
xlsData["MetaData"]

Below are the excel file based stops after we have removed entries that have no coordinates and stop ID (global_Id). One can get more info about global_id here: link
xlsStops = xlsData["StopData"]
We have discarded all those entries that have no latitude or longitude. Also duplicate entries are removed. Every stop has a entry for "valid till". We are considering only those stops that have validity after current time (Now).

Below are the stops. There are total 12930 entries.

Here are available column names in the excel files. We highlight the ones that we consider for our analysis.
{Gültig von,Gültig bis,Letzte Änderung,Ort,Landkreis,GKZ,Alias Ort,Alias GKZ,Globale ID,Zentrale Haltestelle,Ansagetext,Kommentar,Name mit Ort,Nummer,Name ohne Ort,Kurzbezeichnung,Buchsatzkurzname,Anschlussdarstellung,EFA-Zusatzname,Name Aushangfahrplan,Name ZOB-Aushang,Name Betriebszweig 90,Name Fahrscheindrucker,DB Kennung,Attribut "Inneranzeiger 16 Zeichen",Attribut "Inneranzeiger 20 Zeichen",Attribut "Inneranzeiger 24 Zeichen",Bereiche,Bereiche Langbezeichnung,Steige (Bereich),Koordinaten,Tarifzonen,Tarifnummer,Zielcode,Zugeordnete Haltestellen,Unternehmer,GIS-Verkehrsmittel,Georeferenzen (auf Steigebene),Grafische Informationen,Externe Schlüssel,Quellkoordinaten-X,Quellkoordinaten-Y,Zielkoordinaten-X,Zielkoordinaten-Y,Bedienende Linien}
Data Anomaly A
Stops with duplicate/empty global_id
There are 50 stops with duplicate global_Id in this list. This means they have either different names or locations but same global_Id.

How to resolve Anomaly A
These stops should be verified with VRN authorities. We have attached the CSV (anomalyA.csv) file of the above table consisting of fifty stops.
Data Anomaly B
Stops with no Tariff/Wabe info
There are 1641 stops in excel data without wabe number/tariff zone reference.
(* Data without Anomaly A*)
withoutAnomaly1 = Complement[xlsStops, anomaly1];
Length@withoutAnomaly1[Select[#"tariff_zones" === "" &]]
1641
There are 692 stops in excel without any operation (bus/tram/train line) reference. Does this imply these stops are currently inactive and no public transport vehicle service them?
withoutAnomaly1[Select[#"operator_lines" === "" &]] // Length
692
There are 2164 stops in excel data that are without either any operation (bus/tram/train line) reference or wabe number/tariff zone reference.
withoutAnomaly1[Select[(#"operator_lines" === "" || #"tariff_zones" === "") &]] // Length
2164
There are 10714 stops in excel that has both necessary attributes: operation bus/tram/train line and wabe/tariff zone info.
withoutAnomaly2 =
withoutAnomaly1[
Select[! (#"operator_lines" === "" || #"tariff_zones" === "") &]][
All, {"gkz_id" ->
Function[{str}, StringTake[#1, 5] & @@ StringSplit[str, "("]]}][All, {"tariff_zones" ->
Function[val,
ToExpression@
If[StringContainsQ[val, ","], StringSplit[val, ","], {val}]]}]
This are the GKZ ids that can be used to geo-fence whole VRN region if we consider the above 10714 stops.
safe = withoutAnomaly2[[All, "gkz_id"]] // Normal // Union // Sort
{06414,06431,06432,06433,06436,06437,06439,07133,07134,07311,07312,07313,07314,07315,07316,07317,07318,07319,07320,07331,07332,07333,07334,07335,07336,07337,07338,07339,07340,08125,08126,08128,08212,08215,08221,08222,08225,08226,09663,09676,09679,10045,24067}
However if we did not remove the 2164 stops then we could have 16 more GKZ ids in VRN.
all = withoutAnomaly1[
All, {"gkz_id" ->
Function[{str},
StringTake[#1, 5] & @@ StringSplit[str, "("]]}][[All,
"gkz_id"]] // Normal // Union // Sort;
Complement[all, safe]
{06411,06412,06438,07111,07140,07231,07235,08111,08121,08127,09571,09661,09671,09677,10041,10046}
We need to decide if the above GKZ are truly out of VRN region or not. Once we have a correct list of GKZ ids for VRN we can use it to geo-fence the stops in VRN region.
Now it is simple to choose the stops that are in Mannheim (the wabe to consider: {74, 84, 94, 104}).
withoutAnomaly3[Select[ContainsAll[{74, 84, 94, 104}, #"tariff_zones"] &]]
We have total 485 stops qualifying for the given tariff zones/wabe by rnv.
{74, 84, 94, 104}
How to resolve Anomaly B
This stops should be verified with VRN authorities. We have attached the CSV (anomalyB.csv) file of the above table consisting of 2164 stops. To solve Anomaly B two things needs to be assured.
- All stops that are listed in Excel file must have Tariff zone (Excel column:
Tarifzonen) information. - Those stops which have no connection (Excel column:
Bedienende Linien) in Excel file must be checked if they are really part of active stops in VRN.
General data mismatch in Excel and TRIAS
In this analysis we have taken into consideration several available data sources. We can not find a common ground truth as all data sources disagree with each other in some cases. We will now validate the Excel data against the data available in TRIAS.
Geo-spatial clustering of VRN stops to sync with TRIAS
We would like to test if TRIAS has reference/globalID (base on LocationInformationRequest method) for all the stops that are listed in the Excel data. The duplicate free data is taken as a base to test against TRIAS. This data does not have the Anomaly A stops. Now we will spatially cluster the stops into 2500 groups based on geo-distance. We will also find the bounding box for each of those clusters.
result = FindClusters[clustering, 2500, DistanceFunction -> distanceM,
Method -> {"Agglomerate", "Linkage" -> "Complete",
"SignificanceTest" -> {"Gap", "Tolerance" -> 3}}]

Open GIS data issue: We can see many stops specially the ones in the north west of VRN falls outside the GIS VRN boundary. This problem makes us suspect that open GIS data about the VRN boundary is not fully correct. Different colors represents different clusters of stops in the above figure.
Now lets look into one such cluster from the 2500 we formed above. We chose a cluster of stops near Bingen. The red dots forms the cluster of stops from Excel data. The two blue points define a bounding box for the cluster of stops. We have given an extra padding of half a kilometer around the true geometric bounding box (orange color box). The yellow colored disk/circle covers the bounding box and it's radius and center's geo-location will be used to make the query (based on LocationInformationRequest method) against TRIAS .

Excel stops that were not found in TRIAS
The number of stops that exists in Excel data but could not be found in TRIAS after checking 2500 clusters is 1371. This implies that the Excel data has several stops with globalID that could not be found in TRIAS.
notFoundStops = Last@resultTRIAS // Flatten // DeleteDuplicates;
notFoundStops // Length
1371
Many of the Excel stops that could not be found in TRIAS have very similar location and name but different stop reference (globalID) in TRIAS. However if one use the Excel data based globalID to find journeys from TRIAS they will fail as those globalID's have changed (or do not exist) in TRIAS.
notFoundStopsDatasetCompare = Dataset[
Append[
#[[{"name_with_city", "lat", "lon", "global_id"}]]
,
nearest = (First@link[({"lat", "lon"} /. #)]);
Association@Join@
{{"trias_id" -> "stopID" /. nearest,
"trias_name_with_city" -> ( ("stopName" /. nearest) <>
"," <> ("locName" /. nearest))}
,
{"nearest_dsitance" ->
Quantity[
1000 distanceM[({"lat", "lon"} /. #),
ToExpression@({"lat", "lon"} /. nearest)], "Meters"]}
}
] & /@ (notFoundStopsDataset // Normal)
][SortBy[#"nearest_dsitance" &]]

The description of the column names:
name_with_cityname of the stop including locality. (this stop could not be found in TRIAS)- (
lat,lon) geo-location from the excel file. global_idis the ID or stop reference that one uses to call TRIAS API for journey planning and to query trips from the stop. (this globalID could not be found in TRIAS)trias_idis the globalID of a stop nearest to the excel stop available in TRIAStrias_name_with_cityis the name of the nearest TRIAS stop with locality name.nearest_dsitanceis the distance of a stop from excel data to the nearest stop available in TRIAS.
We can see the nearness distribution (distance from nearest TRIAS stop) of the stops that could not be found in TRIAS below. This makes us suspect that many Excel data based stops exist in TRIAS but with different globalID. There are around 575 such stops, see the figure below.

We have attached the CSV (anomaly-TRIAS.csv) file of the above table with 1371 stops.
Geo-fencing the boundary map of VRN
A GIS can recognize and analyze the spatial relationships that exist within digitally stored spatial data (i.e. VRN stops data). These topological relationships allow complex spatial modeling and analysis to be performed. Topological relationships between geometric entities traditionally include adjacency (what adjoins what), containment (what encloses what), and proximity (how close something is to something else). If we have a boundary map of VRN constructed it will be very easy to implement a strict service area largely independent of data changes in TRIAS. Of course if new stops are added in TRIAS that falls outside the constructed boundary map then we will need to update the boundary map again.
Open GIS data based boundary map
We will take GIS date from Internet that represents the boundary of VRN region. We have extracted the shape file of the VRN region from the whole GIS data of Germany.
- The data source: link
Currently these 2377 geo-locations/points will define the boundary of VRN for our analysis.

Let us visualize the boundary of VRN based on the above GIS data.
GeoGraphics[{EdgeForm@Directive[Blue, Thickness[.0015]], Blue,
Polygon[GeoPosition /@ pts],GeoMarker[Here, "Scale" -> Offset[20]]}, GeoZoomLevel -> 9]

Stops situated outside of the VRN boundary (by 2.5 km)
Let us only get the stops that fall outside the VRN boundary by 2.5 kilometers and their respective distances from the boundary. We can see there are 3958 of such stops. The stops are stored in the file supplied with this report (anomaly2.csv) .
xlsStopsOutsideOfVRN =
xlsStops[Select[! memberVRN[ToExpression@{#lat, #lon}] &]][
All, <|#,
"distance_from_boundary" ->
Quantity[
distanceM[ToExpression@{#lat, #lon},
First@distanceVRN[ToExpression@{#lat, #lon}]],
"Kilometers"]|> &][SortBy[-#"distance_from_boundary" &]];
xlsStopsOutsideOfVRN[Select[QuantityMagnitude@#"distance_from_boundary" > 2.5 &]]

Stops situated outside of the VRN boundary (by pure mathematics)
The number of stops situated mathematically outside of the VRN boundary are 4519 in total. We considered the open GIS data based boundary map. The stops are stored in the file supplied with this report (anomaly3.csv) .
Length@xlsStopsOutsideOfVRN
4519
Export stops outside of VRN
These large number of stops should be verified with VRN authorities. We have attached the CSV (anomaly2.csv / anomaly3.csv) file of the above lists. However given the large number of stops falling outside of the GIS boundary, we suspect that the GIS boundary map is not correct.
Next we will try to make our own boundary map based on the stop data from Excel. We will need some computational geometry to do that.
VRN boundary map with computational geometry
In computational geometry, an alpha shape, or α-shape, is a family of piecewise linear simple curves in the Euclidean plane associated with the shape of a finite set of points. They were first defined by Edelsbrunner, Kirkpatrick & Seidel (1983). The alpha-shape associated with a set of points is a generalization of the concept of the convex hull, i.e. every convex hull is an alpha-shape but not every alpha shape is a convex hull.
We will use this methods to construct a map of VRN region. However for any computational geometry activity one needs to fix the concept of distance/metric.
Haversine Distance
For any two points on a sphere, the haversine of the central angle between them is given by
where
is the haversine function:
- d is the distance between the two points (along a great circle of the sphere; see spherical distance),
- r is the radius of the sphere,
- φ1, φ2: latitude of point 1 and latitude of point 2, in radians
- λ1, λ2: longitude of point 1 and longitude of point 2, in radians
On the left side of the equals sign d/r is the central angle, assuming angles are measured in radians (note that φ and λ; can be converted from radians to degrees by multiplying by 180/π ).
Solve for d by applying the inverse haversine (if available) or by using the arcsine (inverse sine) function:
where h is , or more explicitly:
We implement this distance calculation method to approximate distance between two points (Geo-location) given their coordinates in WGS84 GPS datum.
Alpha shape formation through iterative process
We use an iterative process to remove whole from the possible map shapes and arrive at a practical boundary map of the VRN region.
circumRadius2D =
Compile[{{v, _Real, 2}},
With[{a = Norm[v[[1]] - v[[2]]], b = Norm[v[[1]] - v[[3]]],
c = Norm[v[[2]] - v[[3]]]}, (a b c)/
Sqrt[(a + b + c) (b + c - a) (c + a - b) (a + b - c)]],
RuntimeOptions -> "Speed", RuntimeAttributes -> {Listable},
Parallelization -> True];
One can see the results of the iteration process of our algorithm. Form left to right new boundary maps are generated as the iteration goes on. The red boundary is constructed using open GIS data as we have done previously. This boundary is kept for comparison. The black points are the stops GPS location from the Excel/TRIAS data. At the rightmost image the finally constructed boundary of VRN is shown. This constructed boundary is defined by much less points than the apparently wrong GIS data based boundary (red). This boundary is easy to maintain and can be used for advanced geo fencing applications.

We can also see the alpha shape based boundary on a real map for better understanding.

Note: The thick red boundary is created by our above alpha shape based construction given the data points obtained from TRIAS/Excel. The orange boundary is based on open GIS data which is apparently wrong. The red dots are the stops available from TRIAS/Excel. The Blue cluster are stops in tariff zone:{74, 84, 94, 104}and the green colored cluster are stops in Kaiserslautern (zone:800) and the cyan colored cluster represents the stops in Pirmasens ( zone:700).
We can see that the correctness of the alpha shape will depend on the correctness of the data that is in Excel file provided by VRN. If this data is correct the corresponding alpha shape based boundary map that we constructed will be accurate.
Stop acquisition from TRIAS
Assuming that we have access to a geographical map of VRN region we can now use the boundary map data to fetch all the VRN stops from TRIAS. In this section we show and describe the process of fetching stops from TRIAS and some related results.
The process of fetching stops from TRIAS has four main parts.
- Topological operation on the boundary map and discretize (triangulation) the geo-region into a two dimensional mesh which is just a collection of imaginary triangles.
- Form a circumcircle for every triangle in the mesh and use the geometric data to make a recursive query to TRIAS.
- Aggregate the stops obtained from TRIAS via uniqueness testing of the globalID
- Finally perform a geometric containment test of each aggregated stops using the boundary map data.
Triangulation and TRIAS query
Based on the open GIS map the triangulation of VRN region will look like the following. There is around 4900 triangles that discretize the whole region. Main reason for this triangulation is it makes our process of fetching data from TRIAS very fast and robust.
Triangulation
In the following we show the result of the triangulation process. It is a very high quality discretization given the scale of the geo-data. The histogram shows the size of the triangles.

Note: The figure is based on the open GIS based boundary map of VRN. We can use the alpha shape based map for the same purpose. If we use the alpha shape based map then we can use the Excel data set. However for that the excel data needs to be free of Anomaly A and B.
Recursive TRIAS Query
We make a query to TRIAS that adapts the radius based on results obtained. If there is no result it recursively increases the query radius.
triasCall[test_?AssociationQ] :=
Block[{tick, tock, vRequestTimestamp, lat, lon, n, looper, reslooped},
vRequestTimestamp =
DateString[DateObject[Now, TimeZone -> "Europe/Berlin"],
"ISODateTime"];
tick = AbsoluteTime[];
{lat, lon} = First@test["center"];
looper[rad_] := Block[{res, json},
json =
FindClosestStops[vRequestTimestamp, lon, lat, "RADIUS" -> rad,
"Pretty" -> False];
res =
Append[ImportString[
If[json === Null, ExportString[{"numberOfStops" -> 0}, "JSON"],
json], "RawJSON"], "RADIUS" -> rad];
res
];
n = 0;
While[
reslooped = looper[(test["rad"] + 50 n)];
reslooped["numberOfStops"] == 0,
If[reslooped["numberOfStops"] >= 1, Break[]];
n++];
tock = AbsoluteTime[];
Association@{"position" -> test["position"], "trias" -> reslooped,
"compute_time" -> (tock - tick)}
]
In the following we show example of a TRIAS query result around a triangle. We also show the radius distribution of all queries made to the TRIAS API based on locations in VRN region. With this queries we can get all stops and their up to date globalID. These globalID is needed for finding trips from those stops and for journey planning.

Another way to look at all the TRIAS queries and their radius on a map. Red representing bigger radius and green smaller ones. Multiple queries are grouped to form this density histogram.

From the following distribution one can observe that most queries were done to TRIAS with a radius less than a kilometer. This process can give some redundant stops and also stops outside of VRN region but in the next aggregation stage we can filter them out and keep only the unique valid stops.

Stop aggregation and map containment testing
In this stage we aggregate the stops based on their unique globalID and perform a geometric containment test based on the boundary map of VRN. In case we have all the GKZ ids of VRN we can also use that data for the containment testing of aggregated stops.

Note: The number of stops are much less than Excel as we have used the open GIS data based boundary map of VRN thus leaving many stops out due to containment test failure.
Conclusion and process guidelines
Process guideline to select stops in VRN
Once Anomaly A and B are solved we can technically use the VRN stop data for Offline-ticket and Model city Mannheim.
- If rnv can select and provide the GKZ numbers of areas coming under the VRN region, we can use that data to filter the stops in Excel. This will confirm no stops are considered in excel that are outside of VRN region. Then we can use the filtered Excel data for checking in and checking out for eTarif app. In this way we will not allow users to check in or check out outside of VRN region.
- We will also need to have tariff zone information for all the stops in Excel that currently have no tariff information available. This is a requirement to solve Anomaly B. Once all stops in Excel have tariff zone information we can filter them specifically for Model City Mannheim project and give a special tariff on stops falling under the chosen tariff zone.
Here is the German GKZ data repository: link
In the following we show the process flow to collect and validate stops in VRN region by solving the Anomaly A and B. This is done in collaboration between rnv and sevenre. This process can be repeated whenever there is change in stop data from VRN.

Other observations
About the other data related problems we have the following observations. One should look into this topics to further improve the acquisition of stops or to keep the stop list in sync with TRIAS.
- We should also check if the far off stops like Koblenz, Stuttgart etc. should be also removed from the Excel list or not.
- To solve the issue with stops that has globalID which could not be found in TRIAS one can ask VRN why these stops have conflicting globalID compared to TRIAS.
- If a correct boundary map of VRN region is available or constructed (like we did using alpha shapes) from a correct excel data this will help us also to recursively query TRIAS and get all the stops. However getting the tariff zone for each stops will be hard but we can find some way out for that assuming TRIAS has tariff information for all stops with complete accuracy.
- We have analyzed the VDV-452 open data of VRN and found similar mistakes as in the excel data. There are several stops without tariff zone info and several stops (total 298) without GPS locations. However all stops in VDV-452 data of VRN has tariff zone/ wabe information but no globalID.
- We can use corrected excel data or the VDV-452 data of VRN to define the boundary map of VRN and use our recursive process to fetch stops from TRIAS. We can then use the fetched stops from TRIAS as our final collection of stops. This stops will have TRIAS compatible globalID so trips can be shown and journey planning can be done.
© 2022 SevenRE UG (haftungsbeschränkt).
All rights reserved. No graphics, code and results of this report may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of SevenRE.
