API docs¶
compare50¶
-
class
compare50.Comparator¶ Abstract base class for
compare50comparators which specify how submissions should be scored and compared.-
abstract
compare(scores, ignored_files)¶ Given a list of scores and a list of distro files, perform an in-depth comparison of each submission pair and return a corresponding list of
compare50.Comparisons
-
abstract
score(submissions, archive_submissions, ignored_files)¶ Given a list of submissions, a list of archive submissions, and a set of distro files, return a list of
compare50.Scores for each submission pair.
-
abstract
-
class
compare50.Comparison(sub_a, sub_b, span_matches=NOTHING, ignored_spans=NOTHING)¶ - Variables
sub_a – the first submission
sub_b – the second submission
span_matches – a list of pairs of matching
compare50.Spans, wherein the first element of each pair is fromsub_aand the second is fromsub_b.ignored_spans – a list of
compare50.Spans which were ignored (e.g. because they matched distro files)
Represents an in-depth comparison of two submissions.
-
__init__(sub_a, sub_b, span_matches=NOTHING, ignored_spans=NOTHING) → None¶ Method generated by attrs for class Comparison.
-
exception
compare50.Error¶ Base class for compare50 errors.
-
class
compare50.File(name, submission)¶ - Variables
name – file name (path relative to the submission path)
submission – submission containing this file
id – integer that uniquely identifies this file (files with the same path will always have the same id)
Represents a single file from a submission.
-
__init__(name, submission) → None¶ Method generated by attrs for class File.
-
classmethod
get(id)¶ Find File with given id
-
lexer()¶ Determine which Pygments lexer should be used.
-
property
path¶ The full path of the file
-
read(size=- 1)¶ Open file, read
sizebytes from it, then close it.
-
tokens()¶ Returns the preprpocessed tokens of the file.
-
unprocessed_tokens()¶ Get the raw tokens of the file.
-
class
compare50.Pass¶ Abstract base class for
compare50passes, which are essentially ways forcompare50to compare submissions. Subclasses must define a list of preprocessors (functions from tokens to tokens which will be run on every filecompare50recieves) as well as a comparator (used to score and compare the preprocessed submissions).
-
class
compare50.Score(sub_a, sub_b, score=0)¶ - Variables
sub_a – the first submission
sub_b – the second submission
score – a number indicating the similarity between
sub_aandsub_b(higher meaning more similar)
A score representing the similarity of two submissions.
-
__init__(sub_a, sub_b, score=0) → None¶ Method generated by attrs for class Score.
-
class
compare50.Span(file, start, end)¶ - Variables
file – the ID of the File containing the span
start – the character index of the first character in the span
end – the character index one past the end of the span
Represents a range of characters in a particular file.
-
__init__(file, start, end) → None¶ Method generated by attrs for class Span.
-
class
compare50.Submission(path, files, large_files=NOTHING, undecodable_files=NOTHING, preprocessor=<function Submission.<lambda>>, is_archive=False)¶ - Variables
path – the file path of the submission
files – list of
compare50.Fileobjects contained in the submissionpreprocessor – A function from tokens to tokens that will be run on each file in the submission
id – integer that uniquely identifies this submission (submissions with the same path will always have the same id).
Represents a single submission. Submissions may either be single files or directories containing many files.
-
__init__(path, files, large_files=NOTHING, undecodable_files=NOTHING, preprocessor=<function Submission.<lambda>>, is_archive=False) → None¶ Method generated by attrs for class Submission.
-
classmethod
get(id)¶ Retrieve submission corresponding to specified id
-
class
compare50.Token(start, end, type, val)¶ - Variables
start – the character index of the beginning of the token
end – the character index one past the end of the token
type – the Pygments token type
val – the string contents of the token
A result of the lexical analysis of a file. Preprocessors operate on Token streams.
-
__init__(start, end, type, val) → None¶ Method generated by attrs for class Token.
-
compare50.compare(scores, ignored_files, pass_)¶ - Parameters
scores ([
compare50.Score]) – Scored submission pairs to be compared more granularlyignored_files ({
compare50.File}) – files containing distro codepass (
compare50.Pass) – pass whose comparator should be use to compare the submissions
- Returns
Compare50Results corresponding to each of the given scores- Return type
[
compare50.Compare50Result]
Performs an in-depth comparison of each submission pair and returns a corresponding list of
compare50.compare50Results.
-
compare50.expand(span_matches, tokens_a, tokens_b)¶ - Parameters
span_matches ([(
compare50.Span,compare50.Span)]) – span pairs to be expanded wherein the first element of every pair is from the same file and the second element of every pair is from the same filetokens_a ([
compare50.Token]) – the tokens of the file corresponding to the first element of eachspan_matchtokens_b ([
compare50.Token]) –- param tokens_a
the tokens of the file corresponding to the first element of each
span_match
- Returns
A new list of maximially expanded span pairs
- Return type
Expand all span matches. This is useful when e.g. two spans in two different files are identical, but there are tokens before/after these spans that are also identical between the files. This function expands each of these spans to include these additional tokens.
-
compare50.missing_spans(file, original_tokens=None, processed_tokens=None)¶ - Parameters
file (
compare50.File) – file to be examinedoriginal_tokens – the unprocessed tokens of
file. May be optionally specified iffilehas been tokenized elsewhere to avoid tokenizing it again.processed_tokens – the result of preprocessing the tokens of
file. May optionally be specified iffilehas been preprocessed elsewhere to avoid doing so again.
- Returns
The spans of
filethat were stripped by the preprocessor.- Return type
Determine which parts of
filewere stripped out by the preprocessor.
-
compare50.rank(submissions, archive_submissions, ignored_files, pass_, n=50)¶ - Parameters
submissions ([
compare50.Submission]) – submissions to be rankedarchive_submissions ([
compare50.Submission]) – archive submissions to be rankedignored_files ({
compare50.File}) – files containing distro codepass (
compare50.Pass) – pass whose comparator should be use to rank the submissionsn (int) – number of submission pairs to return
- Returns
the top
nsubmission pairs- Return type
Rank submissions, return the top
nmost similar pairs
compare50.passes¶
-
class
compare50.passes.exact¶ Removes nothing, not even whitespace, then uses the winnowing algorithm to compare submissions.
-
class
compare50.passes.misspellings¶ Compares comments for identically misspelled English words.
-
class
compare50.passes.nocomments¶ Removes comments, but keeps whitespace, then uses the winnowing algorithm to compare submissions.
-
class
compare50.passes.structure¶ Compares code structure by removing whitespace and comments; normalizing variable names, string literals, and numeric literals; and then running the winnowing algorithm.
-
class
compare50.passes.text¶ Removes whitespace, then uses the winnowing algorithm to compare submissions.
compare50.preprocessors¶
-
compare50.preprocessors.by_character(tokens)¶ Make a token for each character.
-
compare50.preprocessors.comments(tokens)¶ Remove all tokens that aren’t comments.
-
compare50.preprocessors.extract_identifiers(tokens)¶ Remove all tokens that don’t represent identifiers.
-
compare50.preprocessors.normalize_builtin_types(tokens)¶ Normalize builtin type names
-
compare50.preprocessors.normalize_case(tokens)¶ Make all tokens lower case.
-
compare50.preprocessors.normalize_identifiers(tokens)¶ Replace all identifiers with
v
-
compare50.preprocessors.normalize_numeric_literals(tokens)¶ Replace numeric literals with their types.
-
compare50.preprocessors.normalize_string_literals(tokens)¶ Replace string literals with empty strings.
-
compare50.preprocessors.split_on_whitespace(tokens)¶ Split values of tokens on whitespace into new tokens
-
compare50.preprocessors.strip_comments(tokens)¶ Remove all comments from tokens.
-
compare50.preprocessors.strip_whitespace(tokens)¶ Remove all whitespace from tokens.
-
compare50.preprocessors.text_printer(tokens)¶ Print token values. Useful for debugging.
-
compare50.preprocessors.token_printer(tokens)¶ Print each token. Useful for debugging.
-
compare50.preprocessors.words(tokens)¶ Split tokens into tokens containing just one word.
compare50.comparators¶
-
class
compare50.comparators.Misspellings(dictionary)¶ -
__init__(dictionary)¶ Initialize self. See help(type(self)) for accurate signature.
-
compare(scores, ignored_files)¶ Given a list of scores and a list of distro files, perform an in-depth comparison of each submission pair and return a corresponding list of
compare50.Comparisons
-
score(submissions, archive_submissions, ignored_files)¶ Number of identically misspelled words.
-
-
class
compare50.comparators.Winnowing(k, t)¶ Comparator utilizing the (robust) Winnowing algorithm as described https://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf
- Parameters
t (int) – the guarantee threshold; any matching sequence of tokens of length at least t is guaranteed to be matched
- Parma k
the noise threshold; any matching sequence of tokens shorter than this will be ignored
-
__init__(k, t)¶ Initialize self. See help(type(self)) for accurate signature.
-
compare(scores, ignored_files)¶ Given a list of scores and a list of distro files, perform an in-depth comparison of each submission pair and return a corresponding list of
compare50.Comparisons
-
score(submissions, archive_submissions, ignored_files)¶ Number of matching k-grams.