Bug #51209 for IMDB-Film: Add support for fetching details about videos

Sun Nov 08 15:31:08 2009 gerph [...] gerph.org - Ticket created

Subject:

Add support for fetching details about videos

Hiya, I've added a couple of new functions to support retrieving URLs for videos related to the movies. Unlike the other APIs, this isn't a single shot request, because the fetches require that new pages be fetched in order to locate the correct URL. The 'videogallery' page must be fetched to get the main information, which we can do relatively easily, but there may be multiple pages of information, so we may require multiple fetches for that. Once we've got the details from there - in particular the 'vi' identifier - we can then fetch the actual URL from the player page. The player page embeds a small section of Javascript which constructs the parameters for the flash player. We only want the URL, so we pick this out of the Javascript directly. Thus the videos() API allows us to obtain: The title of the clip The type of clip that it is (eg Featurette, Interview, Clip, Trailer, etc) A possible description A time and/or date that it was posted. A video_id identifier we can use to fetch the clip (and which is also unique, but clients shouldn't rely on any particular content). From this we can use the video_url() API to convert the video_id to a URL, which incurs another fetch. I'm not sure if you think this is worth going in to the IMDB::Film module, but it seems to work well for me and has been relatively reliable so far. Clips appear to be flash video, and I couldn't find a way of getting at the rating information on the films - there's some sort of 'mature content' warning that sometimes appears but I couldn't see how to find out if it applied, etc.

Subject:

diff-Film042-videos.diff

*** v042/Film.pm 2009-11-08 19:43:13.000000000 +0000 --- Film.pm 2009-11-08 19:51:47.000000000 +0000 *************** *** 85,90 **** --- 85,91 ---- _full_companies _recommendation_movies _plot_keywords + _videos full_plot_url ); *************** *** 549,554 **** --- 550,708 ---- return $self->{_full_companies}; } + + =item videos() + + Retrieve references to the videos related to this movie. News items are implicitly filtered. Returns an array reference where each item has following stucture: + + { + type => 'trailer' | 'featurette' | ..., + video_id => <unique imdb identifier for the video, for passing to video_url>, + title => <short title>, + description => <description of the item>, + previewimg_url => <url of a preview image>, + date => 'YYYY-MM-DD', + time => 'hh:mm' | undef + } + + my @videos = @{ $film->videos() }; + + =cut + + sub videos { + my CLASS_NAME $self = shift; + + unless($self->{_videos}) { + my @result; + my $page; + my $pagenum = 1; + while (1) + { + $page = $self->_cacheObj->get($self->code . '_videos' . $pagenum) if $self->_cache; + + unless($page) { + my $url = "http://". $self->{host} . "/" . $self->{query} . $self->code . "/videogallery?page=$pagenum"; + $self->_show_message("URL for video gallery for page $pagenum is $url ...", 'DEBUG'); + + $page = $self->_get_page_from_internet($url); + $self->_cacheObj->set($self->code.'_videos', $page, $self->_cache_exp) if $self->_cache; + } + + my $parser = $self->_parser(FORCED, \$page); + my $tag; + + if ($parser->get_tag('h1')) + { + while (my $tag = $parser->get_tag('img')) { + my $attribs = $tag->[1]; + if (defined $attribs->{class} && + $attribs->{class} eq 'video' && + defined $attribs->{viconst}) + { + my %video = ( 'video_id' => $attribs->{viconst}, + 'description' => $attribs->{title}, + 'previewimg_url' => $attribs->{src}, + ); + # The type of clip is only shown in the text in the preview image. + # Fortunately IMDB construct this image on the fly from the URL + # parameters, so we can just extract it from there. + if (defined $attribs->{src} && + $attribs->{src} =~ /_ZA([^,]+)/) + { + $video{type} = lc($1); + } + else + { + $video{type} = 'unknown'; + } + + # I have never seen a 'News' item which was actually relevant. + next if ($video{type} eq 'news'); + + if ($parser->get_tag('h2') && + $parser->get_tag('a')) + { + $video{title} = $parser->get_trimmed_text; + } + if ($parser->get_tag('br')) + { + my $release = $parser->get_trimmed_text; + $video{date} = $1 if ($release =~ /(\d\d\d\d-\d\d-\d\d)/); + $video{time} = $1 if ($release =~ /(\d\d:\d\d)/); + } + + push @result, \%video; + } + } + } + + if ($page =~ /\d+ – (\d+) of (\d+) Videos/) + { + if ($1 == $2) + { + # All pages fetched + $self->_show_message("All video gallery video processed ($2 videos)", 'DEBUG'); + last; + } + else + { + $self->_show_message("Processed video gallery page $pagenum, (up to video $1/$2); continuing...", 'DEBUG'); + } + } + else + { + $self->_show_message("No other video gallery pages exist", 'DEBUG'); + last; + } + $pagenum++; + } + + $self->{_videos} = \@result; + } + + return $self->{_videos}; + } + + + =item video_url($video_id) + + Retrieve the URL for a video, given its video_id, or undef it one could not be found + + Example to obtain the URL of first trailer of a film: + + my @videos = @{ $film->videos() }; + @videos = grep { $_->{type} eq 'trailer' } @videos; + my $url = (@videos > 0) ? $film->video_url($videos[0]->{video_id}) : undef; + + =cut + + sub video_url { + my CLASS_NAME $self = shift; + my ($vi) = (@_); + + return undef if (!defined $vi || + $vi !~ /^vi\d+$/); + + my $page; + $page = $self->_cacheObj->get($self->code . '_' . $vi) if $self->_cache; + + unless($page) { + my $url = "http://". $self->{host} . "/video/imdb/" . $vi . "/player"; + $self->_show_message("URL for video '$vi' is $url ...", 'DEBUG'); + + $page = $self->_get_page_from_internet($url); + $self->_cacheObj->set($self->code.'_'.$vi, $page, $self->_cache_exp) if $self->_cache; + } + + if ($page =~ /IMDbPlayer\.playerKey = "(http:.*?)"/) + { + # The Javascript contained the URL to use. + return $1; + } + + return undef; + } + =item company() Returns an company given for a specified movie:

Mon Dec 14 05:02:12 2009 stepanov.michael [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #51209] Add support for fetching details about videos
Date:	Mon, 14 Dec 2009 12:01:54 +0200
To:	bug-IMDB-Film [...] rt.cpan.org
From:	Michael Stepanov <stepanov.michael [...] gmail.com>

Hi Justin, Thanks for your patch. I'll review it and add to the next release. On Sun, Nov 8, 2009 at 10:31 PM, Justin Fletcher via RT < bug-IMDB-Film@rt.cpan.org> wrote: Show quoted text

> Sun Nov 08 15:31:08 2009: Request 51209 was acted upon. > Transaction: Ticket created by gerph > Queue: IMDB-Film > Subject: Add support for fetching details about videos > Broken in: 0.42 > Severity: Wishlist > Owner: Nobody > Requestors: gerph@gerph.org > Status: new > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=51209 > > > > Hiya, > > I've added a couple of new functions to support retrieving URLs for > videos related to the movies. Unlike the other APIs, this isn't a > single shot request, because the fetches require that new pages be > fetched in order to locate the correct URL. The 'videogallery' page > must be fetched to get the main information, which we can do relatively > easily, but there may be multiple pages of information, so we may > require multiple fetches for that. Once we've got the details from > there - in particular the 'vi' identifier - we can then fetch the > actual URL from the player page. The player page embeds a small section > of Javascript which constructs the parameters for the flash player. We > only want the URL, so we pick this out of the Javascript directly. > > Thus the videos() API allows us to obtain: > The title of the clip > The type of clip that it is (eg Featurette, Interview, Clip, > Trailer, etc) > A possible description > A time and/or date that it was posted. > A video_id identifier we can use to fetch the clip (and which is > also unique, but clients shouldn't rely on any particular content). > > From this we can use the video_url() API to convert the video_id to a > URL, which incurs another fetch. > > I'm not sure if you think this is worth going in to the IMDB::Film > module, but it seems to work well for me and has been relatively > reliable so far. > > Clips appear to be flash video, and I couldn't find a way of getting at > the rating information on the films - there's some sort of 'mature > content' warning that sometimes appears but I couldn't see how to find > out if it applied, etc. > >

-- Best regards, Michael Stepanov, http://linuxmce.ru

Mon Dec 14 05:02:14 2009 The RT System itself - Status changed from 'new' to 'open'

Thu Dec 24 18:17:42 2009 gerph [...] gerph.org - Correspondence added

Just found a problem with this; the handler on IMDB's side isn't as nice for the videogallery as for the other pages: http://us.imdb.com/title/tt239303/videogallery?page=1 is VASTLY different to: http://us.imdb.com/title/tt0239303/videogallery?page=1 the former returns an insane number of results which are not related to the programme in question. In the sub videos code, I've replaced $self->code in the url with: (sprintf "%07d", $self->code) which appears to do the job. Sorry about that - I hadn't seen it until now whilst debugging things. during debug for that result I was seeing: [DEBUG] Processed video gallery page 1, (up to video 30/76451); continuing... at ./filmdetails.pl line 371 ... which is a little more than I expected really. It's possible that the while (1) should also be changed to something like 'while ($pagenum < 10)' so that if problems occur in other places we don't keep fetching many, many pages. Sorry once again for that problem.

Mon Dec 28 08:40:23 2009 stepanov.michael [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #51209] Add support for fetching details about videos
Date:	Mon, 28 Dec 2009 15:40:07 +0200
To:	bug-IMDB-Film [...] rt.cpan.org
From:	Michael Stepanov <stepanov.michael [...] gmail.com>

Nice to see that you soled that problem :) On Fri, Dec 25, 2009 at 1:17 AM, Justin Fletcher via RT < bug-IMDB-Film@rt.cpan.org> wrote: Show quoted text

> Queue: IMDB-Film > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=51209 > > > Just found a problem with this; the handler on IMDB's side isn't as > nice for the videogallery as for the other pages: > > http://us.imdb.com/title/tt239303/videogallery?page=1 > > is VASTLY different to: > > http://us.imdb.com/title/tt0239303/videogallery?page=1 > > the former returns an insane number of results which are not related to > the programme in question. > > In the sub videos code, I've replaced $self->code in the url with: > > (sprintf "%07d", $self->code) > > which appears to do the job. > > Sorry about that - I hadn't seen it until now whilst debugging things. > during debug for that result I was seeing: > > [DEBUG] Processed video gallery page 1, (up to video 30/76451); > continuing... at ./filmdetails.pl line 371 > > ... which is a little more than I expected really. It's possible that > the while (1) should also be changed to something like 'while ($pagenum > < 10)' so that if problems occur in other places we don't keep fetching > many, many pages. > > Sorry once again for that problem. >

-- Best regards, Michael Stepanov, http://linuxmce.ru

Wed Sep 08 08:25:34 2010 STEPANOV [...] cpan.org - Taken