Sunday, February 7, 2016

CourtHive

Fan Participation Platform

CourtHive is a web application for charting tennis matches. It was inspired by the Match Charting Project and is intended to support the growth of an open-source repository of matches which will aid statisticians in advancing the state of the art with respect to tennis.

Monday, January 25, 2016

Pondering Point Patterns


Tennis players are actors in a social network of fellow competitors, and every exchange of shots in a rally can be seen as a conversation that is part of a narrative encompassing each new achievement in the sport. Do some players try to have the same conversation with every opponent they face? Do some have a more expansive vocabulary and select specific patterns of play every time they face specific opponents? Are some players able to change the conversation when it’s not going their way? Are there ‘truthy’ stories that sports journalists and experts spin to keep the banter lively and the audience engaged? Can statistical analysis reveal whether it makes any sense to be asking these questions?

Read the full article... published on The Tennis Notebook.

Saturday, January 23, 2016

The Refactor Factor - Pursuing Patterns

Refactoring

I've spent the past few months re-factoring the TennisVisuals.com codebase to support a range of new features and data sources that will surface in 2016.  It may not appear that a great deal is happening, and I certainly haven't produced what I thought I would produce in the timelines I've envisioned, but from a cold start less than a year ago I'm finally producing code that I'm willing to share with the public on Github.
The "Match Radar" had a significant overhaul and was my first use of a "reusable" and "updating" approach to D3 chart design.  The release of the source for the "reusable, updating radar chart" was picked up by "Building Widgets" and resulted in an R derivative of the chart, "d3radarR".

More significant for the future of TennisVisuals.com, however, was the redevelopment of the Points-to-Set Chart, which resulted in the creation of a "Universal Match Object" (UMO).  The UMO "understands" the structure of tennis matches and serves as a "validating" container for point data from which numerous "views" of a match may be generated. Points-to-Set is one such "view".
I recently used the UMO to validate 96,000+ matches with point-by-point data which were provided by Jeff Sackmann. (You can read about the results of that analysis here).  As I refactor my codebase the UMO has become central to the integration of disparate data sources.

A great deal of my focus of late has been re-visiting the data which Jeff Sackmann and his team of volunteers have been generating over at the Match Charting Project.  The number of charted matches has been growing by leaps and bounds.  When I started TennisVisuals.com I believed I would be far more focused on data that can be captured using applications such as Pro Tracker Tennis, but once I expanded the project to support multiple data sources the draw of the larger (and growing) MCP dataset began to exert more influence, especially as there is a broader audience for visualizations of professional tennis matches, and a greater opportunity for conversations and collaborations with others who are interested in analyzing the data.

Patterns of Play

Last week Nikita Taparia and Jeff Sackmann launched the Tennis Data Storytelling Challenge.  Since one of the goals of TennisVisuals.com is to showcase visualizations of tennis data, I elected to be one of the sponsors of the challenge.  I will showcase the top three visualizations that result from the challenge and endeavor to present them in such a way that they will be able to utilize the growing database of matches in real time.

To help promote the challenge I wrote a piece for The Tennis Notebook: "Pondering Point Patterns", and to further facilitate the exploration of MCP data I've been busy creating some MCP-specific tools.  The UMO has been central to this effort.  At this point .CSV files generated by the MCP spreadsheet are being validated by the UMO, which is then providing a standard API for querying and manipulating match data.

One of the most significant features of the Match Charting Project is that it is possible to capture not only point progressions and rally lengths, but also some details about specific strokes and shot placements during rallies.  Working with MCP data is driving the development of the UMO to a point where it will not only "understand" the structure of matches, but also the layout of the court.

Over at HeavyTopspin.com Jeff has written a number articles based on analysis of shot-specific data. One of the first articles that caught my eye (there are some color charts) examines the effectiveness of "Return-of-Serve" placement; another article looks at the tactical importance of "finding" an opponents backhand.  On the whole, however, there hasn't been many explorations of this rich portion of the MCP data.  At TennisVisuals.com I've not even touched it, until now...


Tuesday, November 17, 2015

Rally Tree: Point Distribution and Win Percentage

Tennis is an "intermittent" sport.  The level of intensity can vary greatly with the rally length of points and the time taken between points (among other factors including surface, ball type, sex and level of play).  When rallies are visualized they are typically depicted temporally from the first point to the last, which gives a jagged chart where it is difficult to discern any pattern at all.  "Rally Tree" is an attempt to bring a different perspective to the analysis of rallies.

 "Rally Tree" depicts the distribution of points across various rally lengths, beginning at the top with rally lengths of Zero, which indicate either Aces, Serve Winners, or Double Faults. Color coding differentiates errors where balls were "netted" vs. hit long.

 
There are several available views. The default view displays all points for a single match or selection of matches. You can filter by player to display only points served by either player (or composite of opponents).  Notice the number of winners among the points won for servers vs. those receiving.

Additionally there is an overlay depicting the percentage chance that a point was won for any given rally length. The offset vertical lines represent 50% either side of center (0%).
For the "served points" views, this gives a graphic representation of the Persistence of Server Advantage, which varies greatly among players. Please note that this is not the same as percentage of points won for a given rally length.
The "Rally Tree" graphics in this post are of Novak Djokovic's matches at Wimbledon in 2014 and 2015. The last two images depict the persistence of Djokovic's server advantage on the left and a composite of his opponents' server advantage on the left. Djokovic's dominance is obvious.  Apart from rallies of seven, he had a greater than 50% chance of winning all points with rallies up to sixteen.  His opponents' composite server advantage only extended to rallies of five.

You can play around with a live version of Rally Tree and explore your favorite players at TennisVisuals.com

In the near future "Rally Tree" will be integrated with "Game Tree" and other TAVA components so that selections in one component can drive views in another.  For instance, a "Point Progression" from 0-0 to 0-15 can be selected in "Game Tree" to view the distribution of points in the "Rally Tree" or "Points-to-Set".  From this point it will be possible to explore whether there are certain points in a match when rally lengths increase...

To read more about "Persistence of Server Advantage" please follow the link to Jeff Sackmann's blog post on the topic.

Monday, October 26, 2015

State of the Art of the Stats

At the end of August, in a post discussing the exploration and filtering of shots from tennis matches, I referenced the fact that the majority of data generated by tennis matches hasn't been captured; venues where Hawk Eye is available are the exception, and even Hawk Eye doesn't capture every detail that could be useful in downstream analysis of matches.

Certainly today more and more tennis data *is* being captured, and the alternatives for capturing data are expanding steadily - when there is video available it is even possible to capture additional data from historical matches (indeed this is being done to some extent by volunteers for the Match Charting Project, and automated video processing tools are in development which could facilitate this process) - but if "progress" is to be made in understanding how the analysis of tennis data can contribute to the development of the game (which primarily means player development), at least two things need to happen:
  1. tennis data has to be made accessible, and 
  2. tennis data has to be transformed into a standard format
Obvious, right? 

As I've pursued my vision of a Tennis Analytics Integration Platform I've struggled with both of these issues at almost every turn.  My professional career was entirely focused on inter-application, cross-enterprise and intra-enterprise data and process integration, so it is very familiar territory, and unsurprising.  But I *was* surprised when I initially began to track my boys' tennis matches and found that the elegant iPhone/iPad application ProTracker Tennis allows for export of match data in an easily parseable format.  And I was inspired, as many others have been, by the entirely open approach to tennis data and statistics taken by Jeff Sackmann, who makes his entire dataset available on GitHub.  (Here is a great Guardian article about Sackmann's effort).

Still, the integration of data from only two sources requires careful consideration.  As an example, ProTracker Tennis captures a good amount of information, but only a handful of shots out of every point.  The Match Charting Project makes it possible to capture *many* attributes for *every* shot of every point, but doesn't attempt capture of shot coordinates.  The former tracks "Forcing Errors" (but not the error that was forced) while the latter charts "Forced Errors". The "upshot" of such differences is that data must not only be coerced into a "standard" view, but also that it is often not possible to generate the same set of statistics (or visualizations) for matches from each source.  

Admittedly, ProTracker Tennis and the Match Charting Project are, in general, targeting different audiences: ProTracker Tennis is mostly used to track amateur and junior matches while the Match Charting Project is focused on ATP and WTA matches.  Nevertheless, it is possible to use either tool to track any match; and it is further possible that comparing match play of amateurs, juniors and professionals in a common framework will produce useful insights for player development.

Disagreeable Data

This week I (finally) began adding additional data sources for ATP / WTA matches to the Tennis Analytics Integration Platform.  There are now more than 2,500 matches which can be visualized in TAVA, and that number is poised to grow rapidly.  Not only will the number of matches grow; the number of matches for which there are multiple "views" will also grow.  

In an ideal situation, two views of a single match would be entirely complementary, overlapping only for player names, tournament and venue names, and dates.  This does occur when point-progression data is pulled from betting sites (which tend to have score results and match odds but not traditional stats) and pre-calculated statistics are pulled from a tournament site (which don't ever seem to have point-progression data).  

But not all situations are ideal.

Merging data sources is always tricky, even when the data originates from the same domain and ostensibly covers the same conceptual territory. In addition to the challenges presented by the fact that various data sources have modified and extended their own data formats over the years, there is the "not insignificant" issue that a number of tennis statistics are generated from data that is very much subject to the interpretation of the individual doing the gathering.  (see articles by Carl Bialik and Jeff Sackmann on the topic of forced vs. unforced errors).
“I think if you have two or three different people recording unforced errors, you’re going to get two or three different figures,” said Kevin Fischer, senior communications manager for the Women’s Tennis Association.”  - NYT 
The matches from data sources I've added this week overlap significantly with the Match Charting Project.  This is both a headache (from a design and programming point-of-view) and an opportunity. One of the potential weaknesses of the Match Charting Project (eloquently articulated by Stephanie Kovalchik in her blog On-the-T.com) is that match data is gathered by a single person working, often in real time, with a spreadsheet.  Errors can be introduced into the data that are hard to weed out. With a second and even third "view" of point-progression and final statistics, discrepancies can be automatically identified and flagged.  [An obvious addition to Tennis AiP is a Match Editor in which such discrepancies can be resolved - and indeed this is in the roadmap.]

It's important to remember that even the "official" data for professional tennis matches is still being generated by teams of humans working "behind the scenes".  A 2014 article which appeared in The Guardian gives some perspective on the "State of the Art" of how tennis statistics are gathered still today...  A team of 48 "data entry people" are scattered about Wimbledon; most sit court side.  They have technology, yes, but, apart from serve speed, there is a human generating every bit of data. And not every court can be covered to the extent that full stats can be generated. This is apparent even from a review of stats available on IBM's Slamtracker - matches between lower ranked players which take place on peripheral courts often have little to no data available.

All of this leads to the thought that there are further opportunities for crowdsourcing tennis data and generating statistics and match views that could push beyond what is currently being done for the professional tours, even by top-flight corporations.

Machine Dreams and Silo Silliness

But what about Hawk Eye, you may ask?  Yes, impressive technology, generating huge amounts of data, all of it inaccessible to statisticians and aficionados.  It's not even clear that the coaches of top tennis professionals actually know what to do with it all, yet.  Damien Saunder at GameSetMap.com has a number of articles looking at what kind of analysis Hawk Eye data makes possible.  At present, however, none of the official statistics available online appear to have been actually produced by systems such as Hawk Eye; it is used for adjudication and for the generation of graphics during televised performances, to "enhance viewer experience".  Certainly Hawk Eye can be viewed as "State of the Art" technology for raw data capture, but it is not clear that it represents the "State of the Art" for charting and statistical analysis.

It is also not clear to what extent a tennis match *can* be automatically charted, though companies such as Mojjo and PlaySight are now gathering impressive amounts of data and making it possible to deliver useful and usable systems which can be operated by club players.  Both companies are reducing the cost to end-users and, while not yet entirely portable, have made it possible to at least increase court coverage - the number of courts from which some portion of match data could potentially be accessed.  There are also companies such as Tennis Analytics (high end) and Tennis-Stat.com that offer services for post-match video processing from end-user video, but none appear to be as advanced as what Damien Saunder achieved using ArcGIS in this 2012/2013 effort (though Tennis Analytics is enabling broad and powerful exploration and filtering of match data - see the end of this article).

The big question, from my perspective, is whether these relatively new entries into Match Tracking (or Charting) will remain Data/Information Siloes and be as inaccessible to third parties as Hawk Eye. If it is not possible to integrate data from such sources then we are left in a situation where the "State of the Art" could only be advanced within the context of a single application environment (a single corporation), and, barring a scenario where one technical "solution" becomes ubiquitous across all venues where one might potentially play tennis, we could never achieve a Big Data view, even for a single player (though a single player could potentially spend hundreds of thousands of dollars to pay a single service to generate their own data set).

I also see the "Silo Mentality" in practically every tennis application that is available for Tablets and Smart Phones (apart from the sad fact that the vast majority of them are inscrutable and/or useless).   The primary focus of every new tennis tracking application or product offering (including racquet sensors, the new Babolat POP and Pulse Play) is to "build a community" and to cash in on "social marketing"; only a very few such products that I've reviewed (out of several dozen) have any sort of intentional data export capability, and fewer still export any data that is useful for pushing the "State of the Art" in tennis analysis forward or in fact adding meaningfully to what little conversation there is about how any of the data or analysis will contribute to better tennis (to player development).

The most potent and relevant counter example to the situation I've described above is probably the Developer API offered by FitBit, a leader in the "Activity Tracker" market.  There is an increasing number of such gadgets that are allowing third parties, usually for a monthly subscription fee, to access and integrate data in real time.  Similar APIs are catalogued by ProgrammableWeb.  There is no reason why vendors of products and services related to tennis match data could not create similar offerings.  In the future, the Tennis Analytics Integration Platform will expose just such an API, and I have created a number of examples in the hope of inspiring collaborators with more data visualization experience than myself to join the effort.

Conclusion

There seems to be a general agreement that presenting more data is a good thing, and that surely the presentation and even visualization of data will be helpful, but I fear there is a real danger that it is all just a distraction and that the only ones who *may* benefit, for some period of time, are those selling the technology.  It certainly seems that IBM's and SAP's pursuit of Analytics for the ATP and WTA, respectively, is far more about marketing their brands and keeping the audience entertained than it is about actually providing meaningful insights. (Not that entertainment is a bad thing!) This is a critique that appears often, most recently in Nikita Taparia's very entertaining Tennis Note #24.

Ultimately the benefits to be found from analyzing data across a large number of tennis matches will be limited by the subset of common statistics and "views" that can be derived from whatever number of data sources are amenable to integration.  The goal of integration is to maximize the amount of data which can be usefully processed while minimizing the degree to which differences in the structure and "views" offered by each data source impact the scope of viable analysis.

I believe it would ultimately be to the benefit of all who are producing tennis-related applications as well as those who are working in the field of Sports Analytics if there were an Open Data approach to data generated from tennis matches. There have been, without a doubt, many passionate pleas for such an approach that proceeded this screed. If you've made it this far, I'm surprised again!

Sunday, October 4, 2015

Game Tree: Point Progression


"Game Tree" is a depiction of Point Progression for a selection of games within a tennis match or across a series of tennis matches; it is a Sankey Diagram and possesses the "Markov property", meaning that the set of future "states" that are possible are constrained by the current "state", the point score at any moment in a game. Here is a nice interactive explanation.
"Markov Chains" have often been applied to tennis games.  Google "Markov Tennis" and you'll find a large number of articles on statistics which use Tennis to explore probability.  A few of the results use Data Flow Diagrams to depict Point Progression: Wolfram Alpha has an attractive visualization (see above) which was reproduced in an article on Predictive Modeling; and NC State University produced a YouTube video, as part of an online course titled "Introduction to Finite Math", with a whiteboard explanation of state transitions in tennis games.  When visualizations are provided they are usually arranged like the GameFish in TennisVisuals.com, with either horizontal or vertical orientation:

As far as I can tell, the Game Tree design created by Damien Saunder and David Webb at GameSetMap.com is the first time a Sankey Diagram (or Harness Flow Map) was applied to Point Progression in Tennis. The primary innovation was to apply the idea of "quantitive flow lines" to the possible point paths through the tree such that the width of each line represents the frequency with which games passed through each possible "state" for the score, but the real power of the design comes from its interactive nature.  SVG (Scalable Vector Graphics) are used to:
  1. animate exploration of the data when it is filtered by selecting individual games, groups of games, or constraining the games to only service games for a chosen player
  2. provide contextual information when "hovering" over specific elements
The original implementation of Game Tree, presented as a celebration of Nadal's 2013 comeback, used match data downloaded in XML format from the William Hill Sports betting website.


In the TennisVisuals version of Game Tree, data is retrieved in JSON format from the Mongo database which underpins TennisVisuals.com.  That data, in turn, is presently sourced from Jeff Sackmann's Match Charting Project (many other data sources will come online soon).

The inspiration for the Game Tree design seems to have been the same frustration that drove the development of the Points-to-Set chart: the final score of a tennis match reveals very little about how close a match actually may have been. Even a match with a 6-0, 6-0 score may have been "hotly contested".  Traditional stats miss the story every time. Percentage of Points Won for a 6-0, 6-0 match, for instance, provides only a very crude view of match intensity - ranging from 100% for complete dominance by one player to 62.5% for a match in which every game reached Deuce once and only once - it relates nothing of the drama and is of very little use for constructive analysis.

In the following Game Tree visualization of Nadal's service games in a match against Wawrinka at the 2013 Madrid Masters, it is easy to see that Nadal won the first point of his service games 77.8% of the time.  When he did lose the first point in a match, 100% of the time he won the second point.


With Game Tree it is possible to see how often Deuce was reached during a match; the thickness of flow lines even indicates how often game scores ricocheted between Deuce and Advantage.  In the match with Wawrinka, for games both served and received, Nadal lost only one game that reached Deuce:


In the Saunder/Webb implementation of Game Tree, the "Nodes" of the tree are color-coded to indicate momentum.  Dark nodes represent positive momentum while Red nodes represent negative momentum.  In the TennisVisuals version of Game Tree these representations still hold true, but momentum is always viewed from the perspective of the primary player; when filtering for the opponent's service game, the tree is not "flipped", as occurs in the Saunder/Webb version of the Nadal-Djokovic Roland Garros 2014 final.

The relative importance of each point in tennis games, sets, and matches has been analyzed extensively, most famously by Carl Morris in his article "The most important points in tennis", which was published in Optimal Strategies in Sports in 1977.   It is probably impossible to publish analysis of points in tennis without referencing Morris...  here is one of many studies, notable for its visualization of relative point importance within the context of a set:
In 2014, Professors Franc Klassen and Jan Magnus provided ample coverage of the topic in their book "Analyzing Wimbledon".  Most recently Jeff Sackmann wrote a series of blog posts ("How Important is the First Point of Each Game?""The Pivotal Point of 15-30") drawing on a theoretical model which he has published and utilizing his extensive match database.

My plan is to integrate the insights garnered from such analyses into the TennisVisuals version of Game Tree so that results for each match can be viewed in the context of benchmark figures. I'd like to auto-generate summary reports to go along with Game Tree visualizations, similar to what Saunder has done for Nadal-Djokovic 2014 Roland Garros Final.  I also plan to divide each point "flow line" into errors and winners and highlight "clutch" performance.

Shortly after releasing the initial version of Game Tree, Saunder published a follow-up entitled "Where are you most likely to win a point on Nadal's serve?" In this article he introduced a "Proportional Symbol Game Tree" which shows the percent chance at every possible "state" of the score that an opponent had of winning the point.  It's enticing to think about using a similar Game Tree to visualize a player's service game performance in one match relative to their average over the course of the past year...  perhaps overlapped Proportional Symbols of reduced opacity...




Friday, September 18, 2015

Makeover

Today TAVA moved to a new domain: TennisVisuals.com
There are now proper Instructions for using the latest version of the interface; Examples, which had previously only been noted on Twitter, now have their own page.

Most of the changes I've been working on haven't yet surfaced, but the project has graduated from a hobby hosted on my brother's minimally configured server, which I was capable of crashing, to a true application having a real, and expandable, home in the cloud.

I will be posting new visual elements to the Examples page before they are integrated into the application.  The Examples page will also be a place to gather unique visualizations made by using the yet-to-be-published API, which will make a growing database of ATP and WTA matches generally available.

Personal matches charted using ProTracker Tennis will not be made public.  If you use ProTracker Tennis, you can mail your matches to tennis.aip 'at' gmail.com.  You will receive a link by email to the TAVA visualization of your match.