__del__( self )

Eaten by the Python.

Some More Videogreping With Python

| Comments

In this post I give a minimalistic version of Sam Lavigne’s Videogrep and use it to promote world peace.

This week Sam Lavigne wrote a very entertaining blog post introducing Videogrep, a Python script that searches through dialog in videos (using the associated subtitles file), selects scenes (for instance all scenes containing a given word), and cuts together a new video.

The script on Github implements many tweaks and goodies (such as working on multiple files, identifying complex patterns, etc.). In this post I present the code for a minimal videogreper in Python and attempt to refine cuts to get scenes containing whole sentences or single words.

Getting started

A good place to find public domain videos with subtitles is the White House channel on Youtube. In what follows I will be working on the 2012 State Of The Union Address:

To get both the video and the subtitles you can use youtube-dl in a terminal:

1
youtube-dl --write-srt --srt-lang en Zgfi7wnGZlE state.mp4

This downloads a video file state.mp4 and a text file state.en.srt indicating the subtitles as follows:

1
2
3
4
5
6
7
8
1
00:00:00,166 --> 00:00:00,667
(applause)

2
00:00:00,667 --> 00:00:02,066
The President:
Thank you.

This file can be easily parsed in Python to get a list of elements of the form ([t_start,t_end], text_block):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import re # module for regular expressions

def convert_time(timestring):
    """ Converts a string into seconds """
    nums = map(float, re.findall(r'\d+', timestring))
    return 3600*nums[0] + 60*nums[1] + nums[2] + nums[3]/1000

with open("state.en.srt") as f:
    lines = f.readlines()

times_texts = []
current_times , current_text = None, ""
for line in lines:
    times = re.findall("[0-9]*:[0-9]*:[0-9]*,[0-9]*", line)
    if times != []:
        current_times = map(convert_time, times)
    elif line == '\n':
        times_texts.append((current_times, current_text))
        current_times, current_text = None, ""
    elif current_times is not None:
        current_text = current_text + line.replace("\n"," ")

print (times_texts)
1
2
3
>>> [([0.166, 0.667], '(applause) '),
>>>  ([0.667, 2.066], 'The President: Thank you. ')
>>>   ... ]

A simple videogreper

Let us have a look at the most common words in the speech:

1
2
3
4
5
from collections import Counter
whole_text = " ".join([text for (time, text) in times_texts])
all_words = re.findall("\w+", whole_text)
counter = Counter([w.lower() for w in all_words if len(w)>5])
print (counter.most_common(10))
1
2
3
>>> [('applause', 82), ('american', 35), ('america', 33), ('because', 25),
>>>  ('should', 24), ('energy', 23), ('people', 23), ('americans', 20),
>>>  ('country', 18), ('cheering', 15)]

Seems like the word “should” has been pronounced a lot. Let us find the times of all the subtitle blocks in which it appears:

1
2
cuts = [times for (times,text) in times_texts
        if (re.findall("should",text) != [])]

Now we cut and put together all these scenes using MoviePy:

1
2
3
4
5
6
7
8
9
10
11
from moviepy.editor import VideoFileClip, concatenate

video = VideoFileClip("state.mp4")

def assemble_cuts(cuts, outputfile):
    """ Concatenate cuts and generate a video file. """
    final = concatenate([video.subclip(start, end)
                         for (start,end) in cuts])
    final.to_videofile(outputfile)

assemble_cuts(cuts, "should.mp4")

Here is the result:

It is promising, but in some scenes we don’t get to know exactly what should be done, which is frustrating. In the next section we add a little content-awareness to get more relevant cuts.

Greping whole sentences

We now want to cut together all the sentences containing the word “should”. We first explore the whole text looking for sentences containing that word, then we find the subtitle blocks corresponding to the start and end of each sentence, and we cut the video file accordingly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
times, texts = zip(*times_texts)
txt_lengths = map(len, texts) # length of each subtitle block
indices = [sum(txt_lengths[:i]) for i in range(len(texts))]

def find_times(position):
    """ Finds the (t_start, t_end) in the subtitles
        for a given position in the whole text. """
    return times[ max([i for i in range(len(indices))
                       if (indices[i] <= position)])]

# Regular expression matching all sentences with 'should'
regexpr = "([A-Z][^\.!?]*%s[^\.!?]*[\.!?])"%("should")

cuts = [ (find_times(m.start())[0], find_times(m.end())[1])
         for m in re.finditer(regexpr, whole_text) ]

assemble_cuts(cuts, "should_sentence.mp4")

It’s much better:

Note that with just a little more code you could achieve much more. In Videogrep the author uses the Python package pattern to look for advanced phrase constructions, such as all phrases of the form gerund-determiner-adjective-noun.

Greping single words

Let us take a step in the other direction and see if it is possible to automatically cut a scene with exactly one word or expression, and as little possible of the words around it. Consider the following subtitle block:

1
2
00:03:20,100 --> 00:03:23,100
We can do this.

We can roughly evaluate that the word “We” will be pronounced in the first quarter of the time span (from 3:20.1 to 3:20.85), “can” in the second quarter (from 3:20.85 to 3:21.6), etc. Following this reasoning, here is a function that finds the exact times using the relative position of the characters in the subtitles blocks:

1
2
3
4
5
6
7
8
def find_word(word, padding=.05):
    """ Finds all 'exact' (t_start, t_end) for a word """
    matches = [re.search(word, text)
               for (t,text) in times_texts]
    return [(t1 + m.start()*(t2-t1)/len(text) - padding,
             t1 + m.end()*(t2-t1)/len(text) + padding)
             for m,((t1,t2),text) in zip(matches, times_texts)
             if (m is not None)]

Let us try it on “Americans”:

1
assemble_cuts( find_word("Americans"), "americans.mp4")

At least some of the cuts worked properly. If we use much-pronounced words we may find at least one correct cut for each of them and we can build a whole sentence:

1
2
3
4
5
6
7
words = ["Americans", "must", "develop", "open ", "source",
          " software", "for the", " rest ", "of the world",
          "instead of", " soldiers"]
numbers = [3,0,4,3,4,0,1,2,0,1,0] # take clip number 'n'

cuts = [find_word(word)[n] for (word,n) in zip(words, numbers)]
assemble_cuts(cuts, "fake_speech.mp4")

Wow ! That seemed so real, and it almost made sense. From there the cuts could be refined by hand, but the script did most of the work and surely deserved, if not a Nobel Peace Prize, fourteen minutes of applause:

1
2
3
cuts = [times for (times,text) in times_texts
              if (re.findall("applause",text) != [])]
assemble_cuts(cuts, "applause.mp4")

Comments